This project is based on the Data Science Games 2017 hosted on Kaggle. It is a worldwide competition between universities. This year the data was provided by the music streaming application Deezer. They offer a recommendation feature, called Flow, which suggests the users songs they might like to listen to. The algorithm behind “Flow” uses collaborative filtering to provide the user with the right music at the right time. While the algorithm detects the user’s current listing taste by analysing the particular streaming-session where songs are listened or skipped, the first song prediction is the most difficult one. Therefore, the goal of this challenge was to predict how likely it is that the users of the test dataset will listen to the first track of “Flow”.
In the end of the project we finished 66th out of 145 teams with 45 submissions and a score of 0.63860. This report reflects our whole project journey, including our ideas, failures and achievements.
Therefore, the report will give a short introduction to the dataset at first. Sequentially, our approaches in the field of feature engineering are explained and the used prediction models are presented.
The libraries we used for the project are the following:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr,
xgboost,
Matrix,
data.table,
caret,
splitstackshape,
ggplot2,
psychometric,
RColorBrewer,
jsonlite,
httr,
corrplot,
gridExtra,
formattable,
qdap,
tm,
textcat,
rvest,
stringr
)
This chapter illustrates an exploratory analysis of the competition dataset. As it is mentioned before, the goal of this challenge was to predict whether the first recommended song will be listened or not. Not listened means, that the song was skipped within the first 30 seconds and it will be assumed that the user does not like the song.
The test dataset contains the first recommended “Flow”-track for several different users. The train dataset was generated by the user’s listening history for one month, based on songs which were listened in “Flow” and songs which were listened by searches or saved playlists. Each row represents one listened or not listened song.
Data:
Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")
tmp <- Deezer_test[,-1]
tmp$is_listened <- 0
all <- rbind(Deezer,tmp)
rm(tmp)
The train data contains 7.56 million rows and 15 features, while the test set just contains 19.92 thousand rows.
The dataset provides the following columns, which we grouped into user specific, song specific and device specific features. These features were analyzed, NAs detected and properly handled. As all features are characterized as numeric, we convert them into the right data type later on.
User specific:
Song specific:
Device specific:
Other features:
Response variable:
The data contains 19918 different users. Some of them occur over thousands of times, while nearly 1500 users occur just once (with one listened or non listened song). No NAs were found.
user_distribution <- Deezer %>%
group_by(user_id) %>%
summarise(n=n()) %>%
arrange (-n)
myColors <- rep(brewer.pal(7,"Blues")[5:6],50)
ggplot(user_distribution, aes(x=n))+
geom_histogram(bins=100, fill=myColors,color="black")+
scale_x_continuous(breaks=seq(0,2000,50),limits=c(0,2000))+
labs(subtitle="in the train set",
y="number of users",
x="number of songs",
title="Distribution of Observations per User",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("observations_distribution.png",plot=gg6, width = 8, height=5, units="in")
The data reflects the listening behavior of male and female users. However, in this project we do not know how the values of 0 and 1 belong the specific genders. The plot below illustrates the amount of users per gender.
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=user_gender))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="number of users",
x="Gender",
title="Distribution of Users Gender",
caption="Source: Deezer train data")+
theme_light()
The user’s age is in the range of 18 and 30 with a median of 25. The following boxplot illustrates the distribution of the unique users by their age. Accordingly, 50% of the users are 24 years old or younger. The feature age does not contain any NAs in the Deezer dataset.
sum(is.na(Deezer$user_age))
## [1] 0
age <- Deezer %>%
group_by(user_id) %>% summarise(age=max(user_age))
boxplot(age$age)
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],6),"#4292C6")
ggplot(age, aes(x=age))+stat_bin(binwidth=1, fill=myColors, color="black") + scale_y_continuous(limits=c(0,2200),breaks=seq(0,2200,100))+
stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5)+ scale_x_continuous(breaks=seq(18,30,1))+
labs(y="number of users",
x="age",
title="Distribution of Age among Users",
caption="Source: Deezer train data")+
theme_light()
#ggsave("Age_distribution.png",plot=gg7, width = 8, height=5, units="in")
The dataset contains 452975 different media_ids. One media_id represents one specific song from one artist on one specific album. At first, we thought that one media_id reflects one unique song. However, while checking the provided extra information (extra_infos.json), which links each media_id to a song title, album title and an artist name, it can be seen that one unique song (e.g. Everybody from Backstreet Boys) can have more than one media_id. This is caused by the album. Whenever a song is also on another album (e.g. Fetenkult - Best of the 90’s) a new media_id is generated.
One example is covered by the following data table:
##
Found 500 records...
Found 1000 records...
Found 1500 records...
Found 2000 records...
Found 2500 records...
Found 3000 records...
Found 3500 records...
Found 4000 records...
Found 4500 records...
Found 5000 records...
Found 5500 records...
Found 6000 records...
Found 6500 records...
Found 7000 records...
Found 7500 records...
Found 8000 records...
Found 8500 records...
Found 9000 records...
Found 9500 records...
Found 10000 records...
Found 10500 records...
Found 11000 records...
Found 11500 records...
Found 12000 records...
Found 12500 records...
Found 13000 records...
Found 13500 records...
Found 14000 records...
Found 14500 records...
Found 15000 records...
Found 15500 records...
Found 16000 records...
Found 16500 records...
Found 17000 records...
Found 17500 records...
Found 18000 records...
Found 18500 records...
Found 19000 records...
Found 19500 records...
Found 20000 records...
Found 20500 records...
Found 21000 records...
Found 21500 records...
Found 22000 records...
Found 22500 records...
Found 23000 records...
Found 23500 records...
Found 24000 records...
Found 24500 records...
Found 25000 records...
Found 25500 records...
Found 26000 records...
Found 26500 records...
Found 27000 records...
Found 27500 records...
Found 28000 records...
Found 28500 records...
Found 29000 records...
Found 29500 records...
Found 30000 records...
Found 30500 records...
Found 31000 records...
Found 31500 records...
Found 32000 records...
Found 32500 records...
Found 33000 records...
Found 33500 records...
Found 34000 records...
Found 34500 records...
Found 35000 records...
Found 35500 records...
Found 36000 records...
Found 36500 records...
Found 37000 records...
Found 37500 records...
Found 38000 records...
Found 38500 records...
Found 39000 records...
Found 39500 records...
Found 40000 records...
Found 40500 records...
Found 41000 records...
Found 41500 records...
Found 42000 records...
Found 42500 records...
Found 43000 records...
Found 43500 records...
Found 44000 records...
Found 44500 records...
Found 45000 records...
Found 45500 records...
Found 46000 records...
Found 46500 records...
Found 47000 records...
Found 47500 records...
Found 48000 records...
Found 48500 records...
Found 49000 records...
Found 49500 records...
Found 50000 records...
Found 50500 records...
Found 51000 records...
Found 51500 records...
Found 52000 records...
Found 52500 records...
Found 53000 records...
Found 53500 records...
Found 54000 records...
Found 54500 records...
Found 55000 records...
Found 55500 records...
Found 56000 records...
Found 56500 records...
Found 57000 records...
Found 57500 records...
Found 58000 records...
Found 58500 records...
Found 59000 records...
Found 59500 records...
Found 60000 records...
Found 60500 records...
Found 61000 records...
Found 61500 records...
Found 62000 records...
Found 62500 records...
Found 63000 records...
Found 63500 records...
Found 64000 records...
Found 64500 records...
Found 65000 records...
Found 65500 records...
Found 66000 records...
Found 66500 records...
Found 67000 records...
Found 67500 records...
Found 68000 records...
Found 68500 records...
Found 69000 records...
Found 69500 records...
Found 70000 records...
Found 70500 records...
Found 71000 records...
Found 71500 records...
Found 72000 records...
Found 72500 records...
Found 73000 records...
Found 73500 records...
Found 74000 records...
Found 74500 records...
Found 75000 records...
Found 75500 records...
Found 76000 records...
Found 76500 records...
Found 77000 records...
Found 77500 records...
Found 78000 records...
Found 78500 records...
Found 79000 records...
Found 79500 records...
Found 80000 records...
Found 80500 records...
Found 81000 records...
Found 81500 records...
Found 82000 records...
Found 82500 records...
Found 83000 records...
Found 83500 records...
Found 84000 records...
Found 84500 records...
Found 85000 records...
Found 85500 records...
Found 86000 records...
Found 86500 records...
Found 87000 records...
Found 87500 records...
Found 88000 records...
Found 88500 records...
Found 89000 records...
Found 89500 records...
Found 90000 records...
Found 90500 records...
Found 91000 records...
Found 91500 records...
Found 92000 records...
Found 92500 records...
Found 93000 records...
Found 93500 records...
Found 94000 records...
Found 94500 records...
Found 95000 records...
Found 95500 records...
Found 96000 records...
Found 96500 records...
Found 97000 records...
Found 97500 records...
Found 98000 records...
Found 98500 records...
Found 99000 records...
Found 99500 records...
Found 1e+05 records...
Found 100500 records...
Found 101000 records...
Found 101500 records...
Found 102000 records...
Found 102500 records...
Found 103000 records...
Found 103500 records...
Found 104000 records...
Found 104500 records...
Found 105000 records...
Found 105500 records...
Found 106000 records...
Found 106500 records...
Found 107000 records...
Found 107500 records...
Found 108000 records...
Found 108500 records...
Found 109000 records...
Found 109500 records...
Found 110000 records...
Found 110500 records...
Found 111000 records...
Found 111500 records...
Found 112000 records...
Found 112500 records...
Found 113000 records...
Found 113500 records...
Found 114000 records...
Found 114500 records...
Found 115000 records...
Found 115500 records...
Found 116000 records...
Found 116500 records...
Found 117000 records...
Found 117500 records...
Found 118000 records...
Found 118500 records...
Found 119000 records...
Found 119500 records...
Found 120000 records...
Found 120500 records...
Found 121000 records...
Found 121500 records...
Found 122000 records...
Found 122500 records...
Found 123000 records...
Found 123500 records...
Found 124000 records...
Found 124500 records...
Found 125000 records...
Found 125500 records...
Found 126000 records...
Found 126500 records...
Found 127000 records...
Found 127500 records...
Found 128000 records...
Found 128500 records...
Found 129000 records...
Found 129500 records...
Found 130000 records...
Found 130500 records...
Found 131000 records...
Found 131500 records...
Found 132000 records...
Found 132500 records...
Found 133000 records...
Found 133500 records...
Found 134000 records...
Found 134500 records...
Found 135000 records...
Found 135500 records...
Found 136000 records...
Found 136500 records...
Found 137000 records...
Found 137500 records...
Found 138000 records...
Found 138500 records...
Found 139000 records...
Found 139500 records...
Found 140000 records...
Found 140500 records...
Found 141000 records...
Found 141500 records...
Found 142000 records...
Found 142500 records...
Found 143000 records...
Found 143500 records...
Found 144000 records...
Found 144500 records...
Found 145000 records...
Found 145500 records...
Found 146000 records...
Found 146500 records...
Found 147000 records...
Found 147500 records...
Found 148000 records...
Found 148500 records...
Found 149000 records...
Found 149500 records...
Found 150000 records...
Found 150500 records...
Found 151000 records...
Found 151500 records...
Found 152000 records...
Found 152500 records...
Found 153000 records...
Found 153500 records...
Found 154000 records...
Found 154500 records...
Found 155000 records...
Found 155500 records...
Found 156000 records...
Found 156500 records...
Found 157000 records...
Found 157500 records...
Found 158000 records...
Found 158500 records...
Found 159000 records...
Found 159500 records...
Found 160000 records...
Found 160500 records...
Found 161000 records...
Found 161500 records...
Found 162000 records...
Found 162500 records...
Found 163000 records...
Found 163500 records...
Found 164000 records...
Found 164500 records...
Found 165000 records...
Found 165500 records...
Found 166000 records...
Found 166500 records...
Found 167000 records...
Found 167500 records...
Found 168000 records...
Found 168500 records...
Found 169000 records...
Found 169500 records...
Found 170000 records...
Found 170500 records...
Found 171000 records...
Found 171500 records...
Found 172000 records...
Found 172500 records...
Found 173000 records...
Found 173500 records...
Found 174000 records...
Found 174500 records...
Found 175000 records...
Found 175500 records...
Found 176000 records...
Found 176500 records...
Found 177000 records...
Found 177500 records...
Found 178000 records...
Found 178500 records...
Found 179000 records...
Found 179500 records...
Found 180000 records...
Found 180500 records...
Found 181000 records...
Found 181500 records...
Found 182000 records...
Found 182500 records...
Found 183000 records...
Found 183500 records...
Found 184000 records...
Found 184500 records...
Found 185000 records...
Found 185500 records...
Found 186000 records...
Found 186500 records...
Found 187000 records...
Found 187500 records...
Found 188000 records...
Found 188500 records...
Found 189000 records...
Found 189500 records...
Found 190000 records...
Found 190500 records...
Found 191000 records...
Found 191500 records...
Found 192000 records...
Found 192500 records...
Found 193000 records...
Found 193500 records...
Found 194000 records...
Found 194500 records...
Found 195000 records...
Found 195500 records...
Found 196000 records...
Found 196500 records...
Found 197000 records...
Found 197500 records...
Found 198000 records...
Found 198500 records...
Found 199000 records...
Found 199500 records...
Found 2e+05 records...
Found 200500 records...
Found 201000 records...
Found 201500 records...
Found 202000 records...
Found 202500 records...
Found 203000 records...
Found 203500 records...
Found 204000 records...
Found 204500 records...
Found 205000 records...
Found 205500 records...
Found 206000 records...
Found 206500 records...
Found 207000 records...
Found 207500 records...
Found 208000 records...
Found 208500 records...
Found 209000 records...
Found 209500 records...
Found 210000 records...
Found 210500 records...
Found 211000 records...
Found 211500 records...
Found 212000 records...
Found 212500 records...
Found 213000 records...
Found 213500 records...
Found 214000 records...
Found 214500 records...
Found 215000 records...
Found 215500 records...
Found 216000 records...
Found 216500 records...
Found 217000 records...
Found 217500 records...
Found 218000 records...
Found 218500 records...
Found 219000 records...
Found 219500 records...
Found 220000 records...
Found 220500 records...
Found 221000 records...
Found 221500 records...
Found 222000 records...
Found 222500 records...
Found 223000 records...
Found 223500 records...
Found 224000 records...
Found 224500 records...
Found 225000 records...
Found 225500 records...
Found 226000 records...
Found 226500 records...
Found 227000 records...
Found 227500 records...
Found 228000 records...
Found 228500 records...
Found 229000 records...
Found 229500 records...
Found 230000 records...
Found 230500 records...
Found 231000 records...
Found 231500 records...
Found 232000 records...
Found 232500 records...
Found 233000 records...
Found 233500 records...
Found 234000 records...
Found 234500 records...
Found 235000 records...
Found 235500 records...
Found 236000 records...
Found 236500 records...
Found 237000 records...
Found 237500 records...
Found 238000 records...
Found 238500 records...
Found 239000 records...
Found 239500 records...
Found 240000 records...
Found 240500 records...
Found 241000 records...
Found 241500 records...
Found 242000 records...
Found 242500 records...
Found 243000 records...
Found 243500 records...
Found 244000 records...
Found 244500 records...
Found 245000 records...
Found 245500 records...
Found 246000 records...
Found 246500 records...
Found 247000 records...
Found 247500 records...
Found 248000 records...
Found 248500 records...
Found 249000 records...
Found 249500 records...
Found 250000 records...
Found 250500 records...
Found 251000 records...
Found 251500 records...
Found 252000 records...
Found 252500 records...
Found 253000 records...
Found 253500 records...
Found 254000 records...
Found 254500 records...
Found 255000 records...
Found 255500 records...
Found 256000 records...
Found 256500 records...
Found 257000 records...
Found 257500 records...
Found 258000 records...
Found 258500 records...
Found 259000 records...
Found 259500 records...
Found 260000 records...
Found 260500 records...
Found 261000 records...
Found 261500 records...
Found 262000 records...
Found 262500 records...
Found 263000 records...
Found 263500 records...
Found 264000 records...
Found 264500 records...
Found 265000 records...
Found 265500 records...
Found 266000 records...
Found 266500 records...
Found 267000 records...
Found 267500 records...
Found 268000 records...
Found 268500 records...
Found 269000 records...
Found 269500 records...
Found 270000 records...
Found 270500 records...
Found 271000 records...
Found 271500 records...
Found 272000 records...
Found 272500 records...
Found 273000 records...
Found 273500 records...
Found 274000 records...
Found 274500 records...
Found 275000 records...
Found 275500 records...
Found 276000 records...
Found 276500 records...
Found 277000 records...
Found 277500 records...
Found 278000 records...
Found 278500 records...
Found 279000 records...
Found 279500 records...
Found 280000 records...
Found 280500 records...
Found 281000 records...
Found 281500 records...
Found 282000 records...
Found 282500 records...
Found 283000 records...
Found 283500 records...
Found 284000 records...
Found 284500 records...
Found 285000 records...
Found 285500 records...
Found 286000 records...
Found 286500 records...
Found 287000 records...
Found 287500 records...
Found 288000 records...
Found 288500 records...
Found 289000 records...
Found 289500 records...
Found 290000 records...
Found 290500 records...
Found 291000 records...
Found 291500 records...
Found 292000 records...
Found 292500 records...
Found 293000 records...
Found 293500 records...
Found 294000 records...
Found 294500 records...
Found 295000 records...
Found 295500 records...
Found 296000 records...
Found 296500 records...
Found 297000 records...
Found 297500 records...
Found 298000 records...
Found 298500 records...
Found 299000 records...
Found 299500 records...
Found 3e+05 records...
Found 300500 records...
Found 301000 records...
Found 301500 records...
Found 302000 records...
Found 302500 records...
Found 303000 records...
Found 303500 records...
Found 304000 records...
Found 304500 records...
Found 305000 records...
Found 305500 records...
Found 306000 records...
Found 306500 records...
Found 307000 records...
Found 307500 records...
Found 308000 records...
Found 308500 records...
Found 309000 records...
Found 309500 records...
Found 310000 records...
Found 310500 records...
Found 311000 records...
Found 311500 records...
Found 312000 records...
Found 312500 records...
Found 313000 records...
Found 313500 records...
Found 314000 records...
Found 314500 records...
Found 315000 records...
Found 315500 records...
Found 316000 records...
Found 316500 records...
Found 317000 records...
Found 317500 records...
Found 318000 records...
Found 318500 records...
Found 319000 records...
Found 319500 records...
Found 320000 records...
Found 320500 records...
Found 321000 records...
Found 321500 records...
Found 322000 records...
Found 322500 records...
Found 323000 records...
Found 323500 records...
Found 324000 records...
Found 324500 records...
Found 325000 records...
Found 325500 records...
Found 326000 records...
Found 326500 records...
Found 327000 records...
Found 327500 records...
Found 328000 records...
Found 328500 records...
Found 329000 records...
Found 329500 records...
Found 330000 records...
Found 330500 records...
Found 331000 records...
Found 331500 records...
Found 332000 records...
Found 332500 records...
Found 333000 records...
Found 333500 records...
Found 334000 records...
Found 334500 records...
Found 335000 records...
Found 335500 records...
Found 336000 records...
Found 336500 records...
Found 337000 records...
Found 337500 records...
Found 338000 records...
Found 338500 records...
Found 339000 records...
Found 339500 records...
Found 340000 records...
Found 340500 records...
Found 341000 records...
Found 341500 records...
Found 342000 records...
Found 342500 records...
Found 343000 records...
Found 343500 records...
Found 344000 records...
Found 344500 records...
Found 345000 records...
Found 345500 records...
Found 346000 records...
Found 346500 records...
Found 347000 records...
Found 347500 records...
Found 348000 records...
Found 348500 records...
Found 349000 records...
Found 349500 records...
Found 350000 records...
Found 350500 records...
Found 351000 records...
Found 351500 records...
Found 352000 records...
Found 352500 records...
Found 353000 records...
Found 353500 records...
Found 354000 records...
Found 354500 records...
Found 355000 records...
Found 355500 records...
Found 356000 records...
Found 356500 records...
Found 357000 records...
Found 357500 records...
Found 358000 records...
Found 358500 records...
Found 359000 records...
Found 359500 records...
Found 360000 records...
Found 360500 records...
Found 361000 records...
Found 361500 records...
Found 362000 records...
Found 362500 records...
Found 363000 records...
Found 363500 records...
Found 364000 records...
Found 364500 records...
Found 365000 records...
Found 365500 records...
Found 366000 records...
Found 366500 records...
Found 367000 records...
Found 367500 records...
Found 368000 records...
Found 368500 records...
Found 369000 records...
Found 369500 records...
Found 370000 records...
Found 370500 records...
Found 371000 records...
Found 371500 records...
Found 372000 records...
Found 372500 records...
Found 373000 records...
Found 373500 records...
Found 374000 records...
Found 374500 records...
Found 375000 records...
Found 375500 records...
Found 376000 records...
Found 376500 records...
Found 377000 records...
Found 377500 records...
Found 378000 records...
Found 378500 records...
Found 379000 records...
Found 379500 records...
Found 380000 records...
Found 380500 records...
Found 381000 records...
Found 381500 records...
Found 382000 records...
Found 382500 records...
Found 383000 records...
Found 383500 records...
Found 384000 records...
Found 384500 records...
Found 385000 records...
Found 385500 records...
Found 386000 records...
Found 386500 records...
Found 387000 records...
Found 387500 records...
Found 388000 records...
Found 388500 records...
Found 389000 records...
Found 389500 records...
Found 390000 records...
Found 390500 records...
Found 391000 records...
Found 391500 records...
Found 392000 records...
Found 392500 records...
Found 393000 records...
Found 393500 records...
Found 394000 records...
Found 394500 records...
Found 395000 records...
Found 395500 records...
Found 396000 records...
Found 396500 records...
Found 397000 records...
Found 397500 records...
Found 398000 records...
Found 398500 records...
Found 399000 records...
Found 399500 records...
Found 4e+05 records...
Found 400500 records...
Found 401000 records...
Found 401500 records...
Found 402000 records...
Found 402500 records...
Found 403000 records...
Found 403500 records...
Found 404000 records...
Found 404500 records...
Found 405000 records...
Found 405500 records...
Found 406000 records...
Found 406500 records...
Found 407000 records...
Found 407500 records...
Found 408000 records...
Found 408500 records...
Found 409000 records...
Found 409500 records...
Found 410000 records...
Found 410500 records...
Found 411000 records...
Found 411500 records...
Found 412000 records...
Found 412500 records...
Found 413000 records...
Found 413500 records...
Found 414000 records...
Found 414500 records...
Found 415000 records...
Found 415500 records...
Found 416000 records...
Found 416500 records...
Found 417000 records...
Found 417500 records...
Found 418000 records...
Found 418500 records...
Found 419000 records...
Found 419500 records...
Found 420000 records...
Found 420500 records...
Found 421000 records...
Found 421500 records...
Found 422000 records...
Found 422500 records...
Found 423000 records...
Found 423500 records...
Found 424000 records...
Found 424500 records...
Found 425000 records...
Found 425500 records...
Found 426000 records...
Found 426500 records...
Found 427000 records...
Found 427500 records...
Found 428000 records...
Found 428500 records...
Found 429000 records...
Found 429500 records...
Found 430000 records...
Found 430500 records...
Found 431000 records...
Found 431500 records...
Found 432000 records...
Found 432500 records...
Found 433000 records...
Found 433500 records...
Found 434000 records...
Found 434500 records...
Found 435000 records...
Found 435500 records...
Found 436000 records...
Found 436500 records...
Found 437000 records...
Found 437500 records...
Found 438000 records...
Found 438500 records...
Found 439000 records...
Found 439500 records...
Found 440000 records...
Found 440500 records...
Found 441000 records...
Found 441500 records...
Found 442000 records...
Found 442500 records...
Found 443000 records...
Found 443500 records...
Found 444000 records...
Found 444500 records...
Found 445000 records...
Found 445500 records...
Found 446000 records...
Found 446500 records...
Found 447000 records...
Found 447500 records...
Found 448000 records...
Found 448500 records...
Found 449000 records...
Found 449500 records...
Found 450000 records...
Found 450500 records...
Found 451000 records...
Found 451500 records...
Found 452000 records...
Found 452500 records...
Found 453000 records...
Found 453310 records...
Imported 453310 records. Simplifying...
extra_example <- extra %>%
filter(sng_title=="Everybody (Backstreet's Back)" & art_name=="Backstreet Boys")
data.table(extra_example)
## media_id sng_title
## 1: 70222244 Everybody (Backstreet's Back)
## 2: 4254236 Everybody (Backstreet's Back)
## 3: 15391051 Everybody (Backstreet's Back)
## 4: 834740 Everybody (Backstreet's Back)
## 5: 632276 Everybody (Backstreet's Back)
## 6: 14164455 Everybody (Backstreet's Back)
## 7: 605957 Everybody (Backstreet's Back)
## alb_title
## 1: The Essential Backstreet Boys
## 2: Essential Pop Anthems: Classic 80s, 90s and Current Chart Hits
## 3: Backstreet Boys
## 4: 新好男孩超級精選
## 5: Backstreet's Back
## 6: The Very Best Of
## 7: Fetenkult - Best Of The 90's
## art_name
## 1: Backstreet Boys
## 2: Backstreet Boys
## 3: Backstreet Boys
## 4: Backstreet Boys
## 5: Backstreet Boys
## 6: Backstreet Boys
## 7: Backstreet Boys
Furthermore, we can count 151471 different albums and 67142 different artists in the dataset. No NAs were identified.
sum(is.na(Deezer$media_id))
## [1] 0
sum(is.na(Deezer$album_id))
## [1] 0
sum(is.na(Deezer$artist_id))
## [1] 0
According to the extra information from the json file, we know that the media_id and the album_id are strongly correlated with each other. The correlation coefficient in that case is: 0.998958. The following histogram underlines this correlation. Furthermore, some media_ids (and therefore some albums and artists) are recommended quite often over all users.
par(mfrow=c(1,3))
hist(Deezer$media_id)
hist(Deezer$album_id)
hist(Deezer$artist_id)
Likewise, 2922 different genres exist in the dataset. The histogram shows clearly that one specific genre with the id = 0 occurs 3 million times, while the second one for example just occurs only around 1 million times. This would mean, that one specific genre is recommended the most. However, further investigation are done in the feature engineering chapter.
sum(is.na(Deezer$genre_id))
## [1] 0
number_genre <- Deezer %>%
group_by(genre_id) %>%
summarize(n=n()) %>%
arrange(-n)
head(number_genre)
## # A tibble: 6 × 2
## genre_id n
## <int> <int>
## 1 0 3666789
## 2 7 929538
## 3 10 288272
## 4 25 269032
## 5 27 187946
## 6 14 177117
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],25))
ggplot(number_genre, aes(genre_id)) +
geom_histogram(bins=50,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,750),breaks=seq(0,750,100))+
labs(y="number of observations",
x="genre_id",
title="Distribution of Genre_id",
caption="Source: Deezer train data")+
theme_light()
We were also provided with some information about the media duration. But a quick look at the summary will reveal a couple of outliers.
summary(Deezer$media_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 196.0 222.0 231.2 254.0 65535.0
For this visualisation part, we filtered out
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))
ggplot(Deezer%>%filter(media_duration>=quantile(media_duration,0.05) & media_duration<=quantile(media_duration,0.95)), aes(media_duration)) +
geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
x="Media Duration in Seconds",
title="Distribution of Mediaduration",
subtitle="For 90% of the Data",
caption="Source: Deezer train data")+
theme_light()
The column context_type reflects where the song was listened. For example it shows if a song was listened on a playlist or on an album.
sum(is.na(Deezer$context_type))
## [1] 0
#hist(Deezer$context_type)
head(table(Deezer$context_type))
##
## 0 1 2 3 4 5
## 3198365 1617653 1052844 433456 230259 167428
The release date of a song has the format of YYYYMM. By extracting the year of its timestamp, the most occurring release year can be better analyzed. Therefore, new songs from recent years occur most often in our dataset.
sum(is.na(Deezer$release_date))
## [1] 0
releaseDate = as.Date(as.character(Deezer$release_date), format="%Y%m%d")
splitrdt <- data.frame(ryear = as.numeric(format(releaseDate, format = "%Y")))
Deezer = cbind(Deezer, splitrdt)
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],15))
ggplot(Deezer%>%filter(ryear>=quantile(ryear,0.05) & ryear<=quantile(ryear,0.95)), aes(ryear)) +
geom_histogram(bins=30,fill=myColors, color="black")+
labs(y="Number of Observations",
x="Release Year of a song",
title="Distribution of Release Year",
subtitle="for 90% of the data",
caption="Source: Deezer train data")+
theme_light()
The dataset also contains information about the device and the operating system the user used to listen to the music. Both features have 3 values (1:3). As it seems that device “1” and the operating system “1” appears most of the time, we cannot say which specific device or system it is. Furthermore, these two columns are correlated (r=0.5).
sum(is.na(Deezer$platform_family))
## [1] 0
sum(is.na(Deezer$platform_name))
## [1] 0
par(mfrow=c(1,2))
myColors <- brewer.pal(7,"Blues")[5:7]
ggplot(Deezer,aes(x=platform_family))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Observations",
x="Used Devise",
title="Distribution of Platform",
caption="Source: Deezer train data")+
theme_light()
ggplot(Deezer,aes(x=platform_name))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Observations",
x="Used Operation System",
title="Distribution of Platform",
caption="Source: Deezer train data")+
theme_light()
The ts_listen feature includes the time information when a specific song was recommended to a user. It is a UNIX format and will be modified later on. The third chapter about feature engineering covers some visualizations of the timestamp. No NAs were detected.
sum(is.na(Deezer$ts_listen))
## [1] 0
The listen_type feature contains information whether the song was listenend in the “Flow” or not. This is important as the training data includes the whole listening history of a user and not just the behavior on the recommendation. Songs which were not recommended by “Flow” could be songs which the user searched for or saved in specific playlists. Thereby, 5.24 million rows do not represent recommended songs, while 2.32 were predicted by the recommendation engine “Flow”. The test data contains just predicted songs by “Flow”.
myColors <- brewer.pal(7,"Blues")[5:6]
ggplot(Deezer,aes(x=listen_type))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="number of observations",
x="listening type",
title="Distribution of Listening Type",
caption="Source: Deezer train data")+
theme_light()
table(Deezer_test$listen_type)
##
## 0 1
## 1 19917
sum(is.na(Deezer$listen_type))
## [1] 0
Interestingly, we detected one single song inside the test set, which is of listening_type==0.
At last, the most important column is_listened presents the response variable in the modeling part.
myColors <- rep(brewer.pal(7,"Blues")[5:6],2)
ggplot(Deezer,aes(x=is_listened))+geom_bar(fill=myColors,color="black")+
labs(subtitle="in the train set",
y="Number of Observations",
x="Listening Type",
title="Distribution of Is Listened versus Listening Type",
caption="Source: Deezer train data")+
theme_light()+facet_wrap(~listen_type)
Here, we can see that songs which were not suggested (e.g. searched for or listened to in a playlist) have a higher chance of being listened 0.721113 versus songs which have been suggested by the “Flow” 0.6002817.
On the first glimpse, the dataset looks like tall data. However, most of the features are factors. As already mentioned in the previous chapter, the data contains 19918 different users, 2922 genres, 67142 artists, 151471 albums and 452975 unique songs.
Therefore, the main challenge of this project was to create a prediction model with this amount of factors. Furthermore, insufficient data of the column genre_id, a very granular timestamp and release date were the features which needed some improvements.
Furthermore, the following correlation plot visualizes the collinearity between the variables. It can be easily identified that songs and release date as well as the used operating system and device are highly correlated. For all other variables just a slight dependency exists.
corDeezer <- cor(Deezer)
#png("correlationMatrix.png")
corDeezerplot <- corrplot(corDeezer, method = "ellipse", type = "full")
#dev.off()
In this part, we will talk about our Feature Engineering. We will start with easier transformations and calculations going into advanced thinking processes throughout this chapter. We tried to demonstrate all we have achieved and our journey to the point of most success.
Converting the timestamp is the most basic feature engineering part we have done. We extract hour of day as well as weekday from the timestamp. This procedure rather simple, that we did it on the fly, as you will see at the end of the feature engineering section. Nevertheless, the process is described in the following.
The idea behind that is the assumption that the hour of a day or/and the day of week contributes patterns to the prediction, e.g. if users regularly go to the gym on Tuesdays and Thursdays at 8pm or if users tend to listen to slower songs in the evening, while they could listen to fast songs on the way to work. These patterns cannot be found with a time precision of seconds.
transforming_timestamp <- function(x){
timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
splitdt <- data.frame(
hh = as.numeric(format(timestamp, format = "%H")), #24hours format
wd = as.numeric(format(timestamp, format = "%w"))) #weekday
x = cbind(x, splitdt) #cbind the extracted time to data
return(x)
}
transforming_timestamp(Deezer)[1:10,12:17]
## user_id artist_id user_age is_listened ryear hh
## 1 9241 55164 29 0 2004 14
## 2 16547 55830 30 1 2006 23
## 3 7665 2704 29 1 2014 14
## 4 1580 938 30 0 2000 10
## 5 1812 2939 24 1 2008 19
## 6 1812 2939 24 1 2008 22
## 7 1812 2939 24 1 2008 10
## 8 10325 2939 29 1 2008 14
## 9 1812 2939 24 1 2008 14
## 10 51 2939 28 1 2008 13
tmp <- transforming_timestamp(Deezer)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)
ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,600000),breaks=seq(0,600000,50000))+
scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
x="hour of day",
title="Distribution of Listening Hour",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_train.png",plot=gg9, width = 8, height=5, units="in")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")
ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,1300000),breaks=seq(0,1300000,100000))+
scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
x="weekday",
title="Distribution of Listening Weekday",
caption="Source: Deezer train data")+
theme_light()
#ggsave("wd_distribution_train.png",plot=gg10, width = 8, height=5, units="in")
tmp <- transforming_timestamp(Deezer_test)
myColors <- rep(brewer.pal(7,"Blues")[5:6],12)
ggplot(tmp, aes(x=hh))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,1800),breaks=seq(0,1800,100))+
scale_x_continuous(breaks=seq(0,23,1))+
labs(y="observations",
x="hour of day",
title="Distribution of Listening Hour",
subtitle="for test data",
caption="Source: Deezer test data")+
theme_light()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#ggsave("hh_distribution_test.png",plot=gg11, width = 8, height=5, units="in")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],3),"#4292C6")
ggplot(tmp, aes(x=wd))+stat_bin(binwidth=1,fill=myColors, color="black")+
scale_y_continuous(limits=c(0,5000),breaks=seq(0,5000,250))+
scale_x_continuous(breaks=seq(0,6,1))+
labs(y="observations",
x="weekday",
title="Distribution of Listening Weekday",
subtitle="for test data",
caption="Source: Deezer test data")+
theme_light()
#ggsave("wd_distribution_test.png",plot=gg12, width = 8, height=5, units="in")
Beats per Minute (bpm) is an easy, numeric feature which we could query from the Deezer API. We thought it may help us to differentiate within the genres, for example “soft rock” being tendentially slower (lower bpm) than its associated genre “rock”. It was not our intention to create subgenres in any way but just giving this extra information into the model.
For querying the API we used the httr package to query and jsonlite to handle the received answer. To show how the code works, we limit the number of queries to 50 instead of nearly 450.000 media_ids.
## Load necessary packages
library(jsonlite)
library(httr)
uniquetracks <- as.data.frame(unique(all$media_id)[1:50]) #getting all unique media_ids
uniquetracks$api <- paste("https://api.deezer.com/track/",uniquetracks[,1], sep="")
uniquetracks$bpm <- 0
for (i in 1:length(uniquetracks[,1])) { ## for all tracks
this.raw.result <- GET(url = as.character(uniquetracks[i,2])) ## get the infos
this.result <- fromJSON(rawToChar(this.raw.result$content)) ## turn it into a readable format
uniquetracks$bpm[i] <- ifelse(is.null(this.result$bpm),"NA",this.result$bpm) ## get the BPM
#message(as.character(i), appendLF = FALSE) ## print the iteration to see if the code is still working
Sys.sleep(time = 0.05) ## cap the speed so the 50 per 5 seconds are not violated
}
head(uniquetracks,10)
## unique(all$media_id)[1:50] api bpm
## 1 222606 https://api.deezer.com/track/222606 100.1
## 2 250467 https://api.deezer.com/track/250467 160.2
## 3 305197 https://api.deezer.com/track/305197 90.1
## 4 900502 https://api.deezer.com/track/900502 124.0
## 5 542335 https://api.deezer.com/track/542335 120.0
## 6 542341 https://api.deezer.com/track/542341 123.0
## 7 542346 https://api.deezer.com/track/542346 163.0
## 8 542340 https://api.deezer.com/track/542340 95.0
## 9 542347 https://api.deezer.com/track/542347 117.0
## 10 549742 https://api.deezer.com/track/549742 120.0
#save(tracks, file = "tracks_BPM.rda")
After fetching the information from the API, we need to clean it. Because the bpm is received as a character we can set all cells where the string is “NA” to 0 without using is.na(). Transforming them to numeric afterwards leads to 0 missing data but 30000 zeros (only 171 were introduced by setting “NA” to 0).
load("/home/Deezer/30_Wrangled_Data/Archiv/tracks_BPM.rda")
tracks <- tracks[,c(2,3)]
tracks$track_bpm[tracks$track_bpm=="NA"] <- "0"
tracks$track_bpm <- as.numeric(tracks$track_bpm)
myColors <- rep(brewer.pal(7,"Blues")[5:6],20)
ggplot(tracks, aes(x=track_bpm))+stat_bin(bins=40, fill=myColors, color="black")+
scale_y_continuous(limits=c(0,46000),breaks=seq(0,45000,5000))+
scale_x_continuous(breaks=seq(0,240,10))+
labs(title="Distribution of Beats per Minute",
y="observations",
x="beats per minute",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
To fill the zeros, we will take the mean bpm from the album the media is on, or the genre it is from, if the album information is not available. Sadly, at this point we did not have information about the genre, so in order to keep it chronologically correct and not confusing, we will go to the genre section now and will come back to this afterwards.
For increasing the available information we looked for additional sources, like API provider. And we found EchoNest and an R interface (https://github.com/mukul13/rechonest). EchoNest provides a lot of information about songs, albums, artists, etc. Unfortunately, EchoNest was bought by Deezers market competitor Spotify. Furthermore, we found a hint in the DSG forum, that we can use the Deezer API, but are not allowed to use other sources.
In a brainstorming session we thought about additional factors, which influences, if we want to listen to a song or not. One aspect is the timbre of a voice or the gender of an artist, e.g. in combination with a genre. Since we were not allowed to use EchoNest and this information is not available in Deezers API, we discarded this idea.
If one is thinking to the parents music taste, one could claim, that elder people tend to listen to older songs. One explanation could be, those users tend to listen to songs from their “youth”. The question was, if the properties user_age and year of release_date correlate. Unfortunately, user_age and release_date were not significant in our first models.
This approach was to further explore “extra info” data to check whether language detection can help identify new features or not. At first, we did text cleaning and then “textcat” package to identify unique language from song title, album title and artist’s name. The next step was to include logical interaction between this variables. For example, if song language and album language is exact same them grouping them, else considering it as variation. Below is the step by step procedure followed to extract this language features.
# Add new features: join json file with media description (categorical variables)
library(jsonlite)
library(httr)
library(qdap)
library(tm)
extra = stream_in(file("/home/Deezer/60_data_other_models/extra_infos.json"))
extra = extra[1:100,]
str(extra)
##################################################
## add language feats (detect lang)
##################################################
library("textcat")
library("rvest")
library("stringr")
#text cleaning (in order to assign unique numeric value)
extra[] = lapply(extra, tolower)
extra[] = lapply(extra, removePunctuation)
extra[] = lapply(extra, stripWhitespace)
extra$songLang = textcat(extra$sng_title)
extra$albLang = textcat(extra$alb_title)
extra$artistLang = textcat(extra$art_name)
#if song lang and alb lang is exact same them grouping them, else considering it as variation
extra$langSngAlb = ifelse(extra$songLang == extra$albLang, extra$albLang, "variation")
extra$langAlbArt = ifelse(extra$albLang == extra$artistLang, extra$artistLang, "variation")
extra$langSngArt = ifelse(extra$songLang == extra$artistLang, extra$artistLang, "variation")
extra$langReg = ifelse(extra$songLang == "french" | extra$songLang == "german" | extra$songLang == "spanish" |
extra$songLang == "swedish" | extra$songLang == "italian" |
extra$songLang == "polish", "TopEuLang", "EnglishOrOther")
str(extra)
extra$media_id = as.numeric(extra$media_id)
# save(extra,file="extra_feats11.rda")
#add to train_test dataset
#train_test = left_join(train_test, extra, by = "media_id")
#str(train_test)
We checked importance of this language features through feature importance matrix in various XGBoost (label encoded), LightGBM and H2O models but their contributioin to accuracy was relatively low. The illustration of feature importance is included at the end of XGBoost NUM and LightGBM model’s script. * LightGBM(scroll down to the bottom for feature importance matrix)
In a first approach we tried to get enhanced information about the genre_id, by quering the API for the genre_ids. As seen in the visualization part, we were confronted with a vast amount of genre_id==0 and it would be nice to know which genre it is.
uniquegenres <- unique(all$genre_id) #getting all unique genre_ids
uniquegenres <- as.data.frame(paste("https://api.deezer.com/genre/",uniquegenres, sep="")) #paste the ids into the needed API-url to access the informations
uniquegenres$name <- "" #initialise a empty column
uniquegenres$genre_id <- unique(all$genre_id)
At this point we are able to query from the API, using the following loop. To show the results, we set the sample size to 50.
#library(httr)
#library(jsonlite)
for (i in 1:50){ #first 500 genres
this.raw.result <- GET(url = as.character(uniquegenres[i,1])) # get data
this.result <- fromJSON(rawToChar(this.raw.result$content)) # turn data from unix to char and from json into a variable
uniquegenres$name[i] <- ifelse(is.null(this.result$name),"NA",this.result$name) # fill NA where NULL
Sys.sleep(time = 0.1) # System Sleep time to not overload the APIs capacity of 50 requests every 5 seconds
}
head(uniquegenres[,c(2,3)],10)
## name genre_id
## 1 NA 25471
## 2 NA 25571
## 3 Musique asiatique 16
## 4 Seggae mauritien 7
## 5 Afro Pop 3
## 6 NA 2519
## 7 Musique indonésienne 25
## 8 NA 50
## 9 Zouglou 10
## 10 Spiritualité & religion 228
As we can see, the genre names are mainly NAs, and if one would look at row number 21, we would see, that genre_id==0 refers to “all”. All, 0
Now that we know, that we can and should not rely on the genre_id, we need to get the information about the genre from another source. Luckily, we have the album_id and the API provides genre information for each album individually, which is not related to the genre_id in first place.
Genre Information from album_id
albums <- unique(all$album_id) #getting all album ids
albums <- paste("https://api.deezer.com/album/",albums, sep="") #get them in the right format for query the API
albums <- as.data.frame(albums)
albums$album_id <- unique(all$album_id)
#initiating some columns
albums$alb.genre <- ""
albums$alb.genre.id <- ""
In the next chunk we receive the genre information for the first 50 albums.
for (i in 1:50){ #for 50 albums. you can replace 50 by length(albums$albums) to loop over all
this.raw.result <- GET(url = as.character(albums[i,1])) #get the infos
this.result <- fromJSON(rawToChar(this.raw.result$content)) #turn it into a readable format
albums$alb.genre[i] <- ifelse(is.null(this.result$genres$data[2]),"NA",this.result$genres$data[2]) #getting the written genre
albums$alb.genre.id[i] <- ifelse(is.null(this.result$genre_id),"NA",this.result$genre_id) #getting the genre_id from the album
#message(as.character(i), appendLF = FALSE) #print the iteration to see if the code is still working
Sys.sleep(time = 0.05) #cap the speed so the 50 per 5 seconds are not violated
}
formattable(head(albums,50))
| albums | album_id | alb.genre | alb.genre.id |
|---|---|---|---|
| https://api.deezer.com/album/41774 | 41774 | Folk | 466 |
| https://api.deezer.com/album/43941 | 43941 | Electro, Dance , Pop , Rock | 106 |
| https://api.deezer.com/album/48078 | 48078 | Klassik , Filme/Videospiele, Filmmusik | 98 |
| https://api.deezer.com/album/71521 | 71521 | Pop | 132 |
| https://api.deezer.com/album/71718 | 71718 | Country , Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock | 84 |
| https://api.deezer.com/album/72309 | 72309 | NULL | -1 |
| https://api.deezer.com/album/72703 | 72703 | R&B | 165 |
| https://api.deezer.com/album/74082 | 74082 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/74870 | 74870 | NA | NA |
| https://api.deezer.com/album/76052 | 76052 | Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/77825 | 77825 | Pop , Filme/Videospiele, Filmmusik , Folk | 132 |
| https://api.deezer.com/album/79401 | 79401 | Dance, Pop , Rock | 113 |
| https://api.deezer.com/album/80386 | 80386 | Pop , R&B , Spirituelles & Religion | 132 |
| https://api.deezer.com/album/80780 | 80780 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/82356 | 82356 | Rap/Hip Hop , R&B , Contemporary Soul | 116 |
| https://api.deezer.com/album/82750 | 82750 | Brasilianische Musik, Pop , International Pop | 75 |
| https://api.deezer.com/album/82947 | 82947 | Pop , Rock , Indie Rock/Rock Pop | 132 |
| https://api.deezer.com/album/83144 | 83144 | Pop , International Pop , Lateinamerikanische Musik | 132 |
| https://api.deezer.com/album/87478 | 87478 | Rap/Hip Hop, Pop , Rock , R&B , Soul & Funk | 116 |
| https://api.deezer.com/album/87675 | 87675 | Alternative | 85 |
| https://api.deezer.com/album/88463 | 88463 | Alternative, Indie Rock , Electro , Dance , Pop , Rock | 85 |
| https://api.deezer.com/album/89448 | 89448 | Klassik | 98 |
| https://api.deezer.com/album/90630 | 90630 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/94964 | 94964 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk , International Pop, Rock | 85 |
| https://api.deezer.com/album/97328 | 97328 | Pop , Reggae, Rock | 132 |
| https://api.deezer.com/album/97722 | 97722 | Pop , Rock | 132 |
| https://api.deezer.com/album/97919 | 97919 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/99298 | 99298 | Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 |
| https://api.deezer.com/album/99495 | 99495 | Dance | 113 |
| https://api.deezer.com/album/99692 | 99692 | Pop | 132 |
| https://api.deezer.com/album/100086 | 100086 | Heavy Metal | 464 |
| https://api.deezer.com/album/102450 | 102450 | Pop | 132 |
| https://api.deezer.com/album/104223 | 104223 | Rap/Hip Hop | 116 |
| https://api.deezer.com/album/104617 | 104617 | Pop , Rock | 132 |
| https://api.deezer.com/album/106390 | 106390 | Electro , Techno/House , Dance , Jazz , Pop , International Pop | 106 |
| https://api.deezer.com/album/109345 | 109345 | Dance | 113 |
| https://api.deezer.com/album/109542 | 109542 | Electro, Dance , Pop , Reggae , Disco | 106 |
| https://api.deezer.com/album/113876 | 113876 | NA | NA |
| https://api.deezer.com/album/116831 | 116831 | Pop , International Pop, Rock | 132 |
| https://api.deezer.com/album/118013 | 118013 | Country , Pop , Rock , Singer & Songwriter | 84 |
| https://api.deezer.com/album/118801 | 118801 | Rock | 152 |
| https://api.deezer.com/album/119589 | 119589 | Jazz | 129 |
| https://api.deezer.com/album/119983 | 119983 | R&B | 165 |
| https://api.deezer.com/album/120574 | 120574 | Pop , Rock | 132 |
| https://api.deezer.com/album/122544 | 122544 | Jazz | 129 |
| https://api.deezer.com/album/123135 | 123135 | Pop | 132 |
| https://api.deezer.com/album/124120 | 124120 | Pop | 132 |
| https://api.deezer.com/album/124711 | 124711 | Pop | 132 |
| https://api.deezer.com/album/124908 | 124908 | Rap/Hip Hop, R&B | 116 |
| https://api.deezer.com/album/125105 | 125105 | Jazz | 129 |
So far we created a new DF called “albums” which has 4 columns: API-query, album_id, genre and genre_id, Where genre is the written name of the genre and the ID is just the corresponding ID. We can see that some albums have more than one genre and some have NULL.
Looking back at the data, it would be quite easy to get the lists of genre out of the cells of the df. But as we started the project, we were not that fund of performing transformations on lists of lists. And since this report demonstrates what we have done and not what we should have done, here is our original approach.
We used gsub to get rid of all symbols.
albums[,'alb.genre2'] <- gsub("c\\(", "" , albums[,'alb.genre']) #remove "c("
albums[,'alb.genre2'] <- gsub("\\)", "" , albums[,'alb.genre2']) #remove ")"
albums[,'alb.genre2'] <- gsub("\"", "" , albums[,'alb.genre2']) #remove """
albums[,'alb.genre2'] <- gsub(" ", "" , albums[,'alb.genre2']) #remove " "
albums <- splitstackshape::cSplit(albums,splitCols = "alb.genre2",direction = "wide")
formattable(head(albums[,c(2,5,6,7)],10))
| album_id | alb.genre2_1 | alb.genre2_2 | alb.genre2_3 |
|---|---|---|---|
| 41774 | Folk | NA | NA |
| 43941 | Electro | Dance | Pop |
| 48078 | Klassik | Filme/Videospiele | Filmmusik |
| 71521 | Pop | NA | NA |
| 71718 | Country | Alternative | IndieRock |
| 72309 | NULL | NA | NA |
| 72703 | R&B | NA | NA |
| 74082 | Alternative | IndiePop | IndieRock |
| 74870 | NA | NA | NA |
| 76052 | Alternative | IndieRock | Pop |
At this stage we decided to only grab the very first of the listed genres for each album, since it would make most sense, that the main genre is named first. We still have NULLs and NAs.
Our first approach in tackling the NULLs and NAs was to get the albums corresponding artist ids. Then, with the artist information, we were able to lookup the artist´s most played genre to fill in missing data. For Example if artist 1 had 4 albums in genre “rock”, one could assume that a 5th album, which genre is missing is “rock” as well.
albums$artist <- all$artist_id[albums$album_id %in% all$album_id] #joining artist informations
#save(albums,file="bums_art.rda")
#load("bums_art.rda")
At this point we need to load in data which has been saved during our running.
load("/home/Deezer/30_Wrangled_Data/bums_art.rda")
formattable(head(albums,10))
| album_id | alb.genre | alb.genre.id | alb.genre2_01 | alb.genre2_02 | alb.genre2_03 | alb.genre2_04 | alb.genre2_05 | alb.genre2_06 | alb.genre2_07 | alb.genre2_08 | alb.genre2_09 | alb.genre2_10 | alb.genre2_11 | alb.genre2_12 | alb.genre2_13 | alb.genre2_14 | alb.genre2_15 | alb.genre2_16 | artist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41774 | Folk | 466 | Folk | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 55164 |
| 43941 | Electro, Dance , Pop , Rock | 106 | Electro | Dance | Pop | Rock | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 55830 |
| 48078 | Klassik , Filme/Videospiele, Filmmusik | 98 | Klassik | Filme/Videospiele | Filmmusik | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2704 |
| 71521 | Pop | 132 | Pop | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 938 |
| 71718 | Country , Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock | 84 | Country | Alternative | IndieRock | Pop | IndiePop/Folk | Rock | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
| 72309 | NULL | -1 | NULL | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
| 72703 | NULL | -1 | NULL | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
| 74082 | Alternative , Indie Pop , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 | Alternative | IndiePop | IndieRock | Pop | IndiePop/Folk | Rock | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
| 74870 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
| 76052 | Alternative , Indie Rock , Pop , Indie Pop/Folk, Rock | 85 | Alternative | IndieRock | Pop | IndiePop/Folk | Rock | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2939 |
Only with this data you can see what splitstackshape::cSplit really did. It cut the genres from the list of list and got each of them into a new column. Therefore just using the very first of these new columns is what we want, since it is the first entry of the genres. To get going we created a new DF which contains all the missing data (NAs and NULLs).
Na.albums <- albums %>% filter(alb.genre.id=="NA" | alb.genre.id==-1) #get all NAs and NULLs into seperate DF
Na.albums <- as.data.frame(Na.albums[,c(1,20)])
colnames(Na.albums) <- c("album_id","artist_id")
albums2 <- albums[!albums$album_id %in% Na.albums$album_id,] #get clean data into a new DF
As we can see, we have 20521 rows in Na.albums, which means that we have 20521 albums out of 151605 which does not have a genre after quering the API.
all2 <- all[,c(4,13)] #subset all to contain only album_id and artist_id
all2 <- left_join(all2,albums2[,c(1,4)],by="album_id") #joining our new genre
#creating artists loopuptable
loup.artists <- all2 %>%
group_by(artist_id,alb.genre2_01) %>%
summarise(n=n()) %>% #how often a genre at a artist appears. btw: listened/played songs more important than overall genres from albums released
filter(n==max(n))
Na.albums <- merge(Na.albums,loup.artists[,-3],by="artist_id") #reduced from 20.000 NA albums to 999.
albums2 <- albums2[,c(20,1,4)]
albums2 <- bind_rows(albums2, Na.albums[!is.na(Na.albums$alb.genre2_01),-3]) #adding albums with new information to album2
Na.albums <- Na.albums[is.na(Na.albums$alb.genre2_01),] #999 albums still without genre
This implementation was a big success. We were able to make a valid guess on the genre_id by using the artist´s main genre, reducing the number of NAs and NULLs by more than 95%.
To fill the final 999 genres, we used the bpm, which we have collected earlier to identify the genres, but as mentioned as well, we need what we know about the genre so far to fill our gaps in bpm knowledge adequately.
First we get all songs where our bpm is 0, afterwards we first fill in the mean bpm of the album the song is in, and if we don’t have this information, we fill with the mean bpm of the genre.
Note: make sure you have run 2(b) already.
#bpm.data <- all[!duplicated(all$media_id),c("media_id","album_id")]
bpm.data <- unique(all[,c("media_id","album_id")]) #get all media
bpm.data<- merge(bpm.data,tracks,by="media_id") #join bpm
bpm.data<- merge(bpm.data,albums2[,c(-1,-4)],by="album_id", all.x=TRUE) #join genre_id we know so far
#creation if album and genre lookuptables for the everage bpm
bpm.merge.album <- bpm.data %>% filter(track_bpm>0) %>% group_by(album_id) %>% summarise(track_bpm=mean(track_bpm))
bpm.merge.genre <- bpm.data %>% filter(track_bpm>0) %>% group_by(alb.genre2_01) %>% summarise(track_bpm=mean(track_bpm))
#fill in album_mean_bpm where it is known
bpmclean2 <- bpm.data[bpm.data$track_bpm==0,]
bpmclean2 <- merge(bpmclean2[,c(1,2,4)],bpm.merge.album, by="album_id",all.x=TRUE)
bpmclean3 <- bpmclean2[is.na(bpmclean2$track_bpm),] #where we dont have bpm data from the album
bpmclean3 <- merge(bpmclean3[,c(-4)],bpm.merge.genre, by="alb.genre2_01",all.x=TRUE) #join the information from the average_genre_lookup table
What was quite amusing is one very resilient little fellow: In all 7.5 million songs in our data, there is only one “HörbuchaufDeutsch”. Since it is the only one of his kind, and it is an album by itself, we do not have any information, but we set the bpm to 80, which is on the lower end of the scale.
bpmclean3[is.na(bpmclean3$track_bpm),]
## alb.genre2_01 album_id media_id track_bpm
## 5899 HörbücheraufDeutsch 13903144 130948010 NA
bpmclean3[is.na(bpmclean3$track_bpm),4] <- 80
Now we only had to clean everything and bind it back into a single bpm lookup table, which we will call tracks_cleaned.rda.
bpm1 <- bpm.data %>% filter(track_bpm>0)
bpm2 <- bpmclean2 %>% filter(!is.na(track_bpm))
bpm2 <- bpm2[,c(1,2,4,3)]
bpm3 <- bpmclean3[,c(2,3,4,1)]
tracks_clean <- rbind(bpm1,bpm2,bpm3)
tracks_clean <- tracks_clean[!duplicated(tracks_clean$media_id),]
tracks_clean2 <- tracks_clean[,c(2,3)]
#save(tracks_clean, file="data_for_genre_loop.rda")
#save(tracks_clean2,file="tracks_cleaned.rda")
load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda")
myColors <- rep(brewer.pal(7,"Blues")[5:6],10)
ggplot(tracks_clean2, aes(x=track_bpm))+stat_bin(bins=20, fill=myColors, color="black")+
scale_y_continuous(limits=c(0,80000),breaks=seq(0,80000,5000))+
scale_x_continuous(breaks=seq(50,220,10),limits=c(50,220))+
labs(title="Distribution of Beats per Minute",
y="observations",
x="beats per minute",
caption="Source: Deezer train data")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("BPM_distribution.png",plot=gg8, width = 8, height=5, units="in")
To fill the missing genre_ids for our 999 observations, we will conduct a glm model to predict the genre by the songs bpm, since we have already exhausted all other possibilities. According to a majority vote on how the genre_id is distributed on the known cases, we could assume that the missing ones are “pop”.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
library(treemap)
#png(filename="genre_treemap.png",width=16, height=9, units = "in",res=72)
treemap(DeezerNew_train_0525 %>% group_by(alb.genre2_01) %>% summarise(n=log(n())), index="alb.genre2_01",vSize = "n",title = "Logarithmic Distribution of Genres", palette = "GnBu")
#dev.off()
Note: this code will run for some minutes (glm for 47 models). For the sake of creating our html report, we set the chunks eval=FALSE to prevent the chunk from running. You can skip the following two chunks without missing too much information.
test.loop <- tracks_clean %>% filter(is.na(alb.genre2_01)) %>% select(album_id,track_bpm)
results <- test.loop
#loop and fit
for (i in levels(tracks_clean$alb.genre2_01)){
tracks_clean$y <- as.factor(as.numeric(tracks_clean$alb.genre2_01==i))
train.loop <- tracks_clean %>% filter(!is.na(alb.genre2_01)) %>% select(track_bpm,y)
fit <- glm(y~track_bpm,data=train.loop, family=binomial(link = "logit"))
results[,i] <- predict(fit,test.loop, type="response")
}
results <- results[,-2]
At this point we have done a prediction for every missing album and can show, that all of them are considered to be “pop”.
albums.finish <- melt(results,id.vars="album_id")
albums.finish <- albums.finish %>% group_by(album_id,variable) %>% summarise(value=sum(value))
albums.finish <- albums.finish %>% group_by(album_id) %>% filter(value==max(value))
table(albums.finish$variable)
The tables shows that for all 20521 songs from out missing 999 albums, the looped glm model suggested that all of the songs are so be considered “pop”, which also followed the majority vote.
Na.albums$alb.genre2_01 <- "Pop"
albums2 <- bind_rows(albums2[,c(2,3)], Na.albums[,c(2,3)]) #adding albums with new information to album2
albums2 <- albums2[!duplicated(albums2$album_id),]
#save(albums2,file="album_genre_clean.rda")
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda") #we load a newer version of the data at this stage, but we only build on what has been worked out so far.
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
One of our first thoughts about this project was: “the genre is the most important indicator if you like a song or not.” At this point in time we already had trustworthy information about the genre, so we could feature engineer further. In a first attempt we generated a clustering based on genre-listening-history. Sadly we cannot deliver the original code, because the Zeno Server crashed after running the code and deleted the at this point unsaved rmd file. Luckily, we saved the output of the code in an rda.
Here is an attempt to recreate the code to give you an impression:
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and each genre
summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))
profs <- profs[,-3] %>%
dcast(user_id ~ alb.genre2_01, value.var="p") #dcast is the invers of melt, but creating a new column for each value pair.
profs[is.na(profs)] <- 0
profs[1:10,c(1,3,13,17,29,41)]
## user_id Alternative Dance Electro Jazz Rock
## 1 0 0.117741935 0.12338710 0.247741935 0.066290323 0.015483871
## 2 1 0.178102190 0.07023520 0.096025953 0.089213301 0.026277372
## 3 10 0.014770241 0.13676149 0.054704595 0.000547046 0.003829322
## 4 100 0.078269825 0.08444902 0.024716787 0.013388260 0.019567456
## 5 1000 0.201660735 0.10083037 0.186239620 0.000000000 0.059311981
## 6 10000 0.012658228 0.01265823 0.012658228 0.012658228 0.025316456
## 7 10001 0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
## 8 10002 0.097297297 0.01621622 0.021621622 0.037837838 0.075675676
## 9 10003 0.000000000 0.03623188 0.007246377 0.007246377 0.000000000
## 10 10004 0.009174312 0.00000000 0.000000000 0.000000000 0.009174312
This chunk gave us the opportunity to look how genres are distributed for every user individual, e.g. user 1 listened 11% “Alternative”, 12% “Dance”, 24% Electro and so on.
We used this information to create clusters of similar users, using hclust over kmeans, because we did not know how many clusters we wanted to come up with and did not wanted to rerun the code multiple times.
The code looked somewhat like this, note that creating the distance-matrix will take around 2 minutes on the zeno-server and around 5-10 minutes on a regular machine, since it consists of 198 million elements.
d <- dist(profs[,-1])
We want to use this journey back into our beginnings to show how different clustering methods looked like. However, we used the method “complete” linkage (which means, that the maximum distance will be used to link two clusters) and set the cut-off threshold to 35 clusters.
c.tree.comp = hclust(d,method="complete")
w.tree.comp = hclust(d,method="ward.D2")
s.tree.comp = hclust(d,method="single")
par(mfrow=c(1,3))
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)
labs.comp = cutree(c.tree.comp,k=35)
#lookup <-as.data.frame(profs$user_id)
#lookup$profile_id <- labs.comp
#colnames(lookup) <- c("user_id","profile_id")
This new feature added to our previously best model increased our score by 4%.
We did not stop at this point, since there are some aspects which are worrying, e.g. that we do not take into account the number of observations or how often a genre was played, in contrast, we only look at how often a user has listened to a genre. We kept on developing the idea, which is worth a chapter on its own.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
DeezerNew_train_0525$is_listened <- as.numeric(DeezerNew_train_0525$is_listened)-1
library(dplyr)
library(splitstackshape)
library(reshape2)
Sometimes a simple idea can lead to a cascade of thoughts and breakthroughs, as happened in our case. The thought was a simple one: “When we look at the distribution of genres for each unique user and cluster them, don´t we actually ignore the preferences of the individual user?” What for example happens to a user, who has multiple favorite genres? Imagine a user who listens only to 5 genres in an equal distributed fashion - each of those genres have a score of 0.2 in our previous approach. Now imagine another user who listens to two genres, one 80% and one 20%. At the moment we propose, that the genres with a score of 0.2 are equally important, even though it is clearly not the case: For user 1, every genre is his favorite, while for user 2 the second genre is not his favorite.
How to bypass this problematic? By scaling our previous results user-individual. That means, we will take an user’s favorite genre (the max() of his scores), and set it to 1 and scale all other genres accordingly. With this transformation we are able to compare different users. Going back to the example from above: For user 1, each of his/her 5 genres is now a score of 1 while for user 2 only the first genre is 1 and the second is scaled respectively.
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and each genre
summarise(c = sum(is_listened)) %>% #is_listened is 0/1 coded as.numeric, so sum() works fine
mutate(p = c/sum(c)) #summarise ungroups once, so at this point we are at groub_by(user_id), therefore sum(c)==sum(c | user(i))
profs <- profs[,-3] %>%
dcast(user_id ~ alb.genre2_01, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x))))) #using min(x) instead of 0 for convinience, since for all observations min(x)==0
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything
#creating a lookuptable for final models, better not run it again, it takes ages.
#genre_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_scaled")
#save(genre_scaled.molten,file="30_05_genre_scaled_lookup.rda")
tmp <- melt(profs)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,0.65,0.05))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="relative frequency", title="Relative Genre Frequency",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("relative_genre.png",plot=gg1, width = 8, height=5, units="in")
tmp <- melt(scaled)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,1,0.05))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="scaled relative frequency",title="Scaled Relative Genre Frequency",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("scaled_relative_genre.png",plot=gg2, width = 8, height=5, units="in")
As told, an idea can lead to a cascade of ideas, here is our second thought: “What if the Deezer Flow-Algorithm suggested a genre very often, but it was not listened to often, even though still often enough to let it appear as a favorite genre for this user?”. To get a sense for this possibility, we created a new table with the percentage of how often a genre is listened when played for every user.
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and genre
summarise(c = sum(is_listened),n=n(),p=c/n) #sum(is_listened) again, but this time n() is equal to count() and p as the quotient of c/n
profs <- profs[,-c(3,4)] %>%
dcast(user_id ~ alb.genre2_01, value.var="p") #long to wide format
profs[is.na(profs)] <- 0 #NAs are equal to 0
listened <- profs[,-1]
listened$user_id <- profs$user_id
#genre_listened.molten <- melt(listened, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_listened")
#save(genre_listened.molten,file="30_05_genre_listened_lookup.rda")
tmp <- melt(listened)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378 | user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+
scale_y_continuous(breaks=seq(0,1,0.05),limits = c(0,1))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="average is_listened",title="Average is_listened per Genre",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("average_is_listened.png",plot=gg3, width = 8, height=5, units="in")
Combining the scaled genre preferences as well as the listening history is exactly what we had in mind. As one of the value decreases, the overall trustworthiness of the data is going down. As long as both values are high or extremely low, we can be pretty sure, that a user has listened or not listened to the song respectively.
One aspect we have not considered the sample size yet. Until this point we cannot differentiate between a user, who has listened for example to only one song once (scaled value and listening behavior is 1) and someone who has heard to a genre hundreds of time.
To compensate this in some way, we extracted this information as well:
profs <- DeezerNew_train_0525 %>%
group_by(user_id,alb.genre2_01) %>% #for each user and genre
summarise(n=n()) %>% #count
dcast(user_id ~ alb.genre2_01, value.var="n") #long to wide
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id
#genre_counts.molten <- melt(counts, id.vars="user_id",variable.name="alb.genre2_01",value.name="genre_counts")
#save(genre_counts.molten,file="30_05_genre_counts_lookup.rda")
tmp <- melt(counts)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,90,5),limits = c(0,90))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="Count",title="Count per Genre",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("count.png",plot=gg4, width = 8, height=5, units="in")
Test, if columns and user_ids are arranged identically for each of the three tables.
sum(names(listened) != names(scaled)) #should be 0
## [1] 0
sum(names(listened) != names(counts))
## [1] 0
sum(scaled$user_id!=listened$user_id)
## [1] 0
sum(scaled$user_id!=counts$user_id)
## [1] 0
In our last step we combined the three different values. To show you our thought process throughout the report, we will show you how the combination changed over some testing and iterative feedback rounds.
To demonstrate what drove the changes, we will create a set of easy example data. Where:
g <- c(1,.8,.6,.3) #Genre-Distribution
l <- c(.25,.7,.8,.95) #Listening-History
c <- c(200,68,45,19) #Count
Our first approach was glsqrt(c) using c as a weight to get higher index values, when we have more data available. We had never the intention to keep the value inside a 0:1 boundry
g*l*sqrt(c)
## [1] 3.535534 4.617878 3.219938 1.242286
As we can see, there is not enough emphasis on the last case. The last genre has 95% listening rate over 19 observations, if this genre would be played again, we could pretty sure predict, that the user is going to listen to it.
Additionally case 1 still has the second highest value, which should indicate a high likelyhood of “is_listened” for a future song of this genre. But the listening value is only 0.25, which indicates, that it is rather unlikely, that the user is going to listen to the genre next time.
In a first step, we exchanged the sqrt() to a log(x+1) function, which gives less weight relatively to the growth in count, the +1 inside the log is needed because log(0) otherwise would lead to -Inf.
par(mfrow=c(1,2))
plot(log(1:1000))
plot(sqrt(1:1000))
g*l*log(c+1)
## [1] 1.3258262 2.3710996 1.8377479 0.8537837
We can see that the values are closer together now; still, the log is a strong enough penalty for low count cases. As wanted, case 1 has a lower value than before since we don´t give him that much weight through his high n.
Since we wanted more emphasis on the listening history, we thought about assigning a weight manually, something like:
weight =5
g*(l*weight)*log(c+1)
## [1] 6.629131 11.855498 9.188739 4.268918
But tweaking an imaginary value isn’t what we wanted to do. A more solid way of getting more emphasis on the listening behavior, was to shrink the g by using the sqrt(g). The Square root performed well, because g is ranged between 1:0, so sqrt(max(g))==1 and sqrt(min(g))==0 and every value for g which is <1 will get smaller, which in turn is more weight on l. Additionally it has a stronger impact on lower values, which is a beneficial side effect.
plot(sqrt(seq(1,0,length.out = 100)))
sqrt(g)*l*log(c+1)
## [1] 1.325826 2.650970 2.372522 1.558789
In this final version we can see, that case 4 is more important than case 1, but still, due to the small sample size not too overdone.
Here is the final code, which computes the behavior_index:
behavior <- sqrt(scaled[,-47])*listened[,-47]*log(counts[,-47]+1)
behavior$user_id <- scaled$user_id
#behavior_molten <- melt(behavior, id.vars="user_id",variable.name="alb.genre2_01",value.name="behavior")
#save(behavior_molten,file="25_05_new_behavior_lookup.rda")
tmp <- melt(behavior)
ggplot(tmp %>% filter(user_id==15980 | user_id==19378| user_id==10971), aes(x=variable,y=value, fill=user_id))+
geom_bar(stat="identity", position="dodge")+scale_y_continuous(breaks=seq(0,4,0.25),limits = c(0,4))+
scale_fill_brewer(palette="Set1")+
labs(x="genre",y="behavior index",title="Behavior Index",subtitle="for three selected users")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("behavior_index.png",plot=gg5, width = 8, height=5, units="in")
tmp <- cbind.data.frame(sqrt(seq(0,1,length.out = 100)),seq(0,1,length.out = 100))
colnames(tmp) <- c("after","before")
ggplot(tmp,aes(x=before,y=after))+geom_point()+
scale_y_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
scale_x_continuous(breaks=seq(0,1,0.1),limits = c(0,1))+
scale_fill_brewer(palette="Set1")+
labs(x="x",y="sqrt(x)",title="Sqrt(x) versus x")+
theme_light()+
theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("sqrt.png",plot=gg6, width = 8, height=5, units="in")
As you can imagine, the Behavior-Matrix is sparsely filled, since unlistened genres are set to 0, because g will be 0, even if c is not. Higher values of Behavior_index for a genre is indicating a higher likelihood of getting listened to again by the user, while values close to 0 are indicating a low likelihood, respectively. We used the data for clustering as well as a feature itself. To use it as a feature we melt() the matrix and saved it as a lookup table, which we will later left join to the data using user_id and genre_id as a key.
formattable(behavior[1:10,1:10])
| AfrikanischeMusik | Alternative | AsiatischeMusik | Blues | Bolero | BrasilianischeMusik | Comedy | ContemporaryR&B | ContemporarySoul | Corridos |
|---|---|---|---|---|---|---|---|---|---|
| 0.1919238 | 4.42734929 | 0.00000000 | 0.01754381 | 0 | 0.1206416 | 0 | 0.08147095 | 0.41231820 | 0 |
| 0.0000000 | 4.82799863 | 0.01150143 | 0.00000000 | 0 | 0.3301773 | 0 | 0.06739716 | 0.05027516 | 0 |
| 0.2037738 | 0.26026544 | 0.00000000 | 0.00000000 | 0 | 0.1207933 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.0000000 | 0.78374923 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.3581292 | 2.67147990 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.3856975 | 0.05425414 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.0000000 | 0.00000000 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.0000000 | 1.21911356 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 2.1302331 | 0.00000000 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
| 0.0000000 | 0.03583519 | 0.00000000 | 0.00000000 | 0 | 0.0000000 | 0 | 0.00000000 | 0.00000000 | 0 |
Finally, we can replace the previous clustering by clustering across the users Behavior_index. If you have not read 2(d), you should go back and read the explanations for our clustering approaches and codes.
We want to use this journey back into our beginnings to show how different clustering methods looked like. In the end, we used the method “complete” and set the cutoff point to 35 clusters.
d <- dist(behavior[,-47])
c.tree.comp = hclust(d,method="complete")
w.tree.comp = hclust(d,method="ward.D2")
s.tree.comp = hclust(d,method="single")
#png(filename="clustering_types.png",width=16, height=9, units = "in",res=72)
par(mfrow=c(1,3))
plot(c.tree.comp)
plot(w.tree.comp)
plot(s.tree.comp)
#dev.off()
#png(filename="clustering_complete.png",width=16, height=9, units = "in",res=72)
plot(c.tree.comp)
#dev.off()
tree.comp = hclust(d,method="complete")
labs.comp = cutree(tree.comp,k=35)
cluster_user_lookup <-as.data.frame(behavior$user_id)
cluster_user_lookup$profile_id <- labs.comp
colnames(cluster_user_lookup) <- c("user_id","profile_id")
# save(cluster_user_lookup, file="23_05_SQRT_35_Clusters.rda")
load("/home/Deezer/30_Wrangled_Data/23_05_SQRT_35_Clusters.rda")
myColors <- c(rep(brewer.pal(7,"Blues")[5:6],17),"#4292C6")
ggplot(cluster_user_lookup, aes(x=profile_id))+stat_bin(binwidth=1,fill=myColors, color="black")+scale_y_log10()+
scale_x_continuous(breaks=seq(1,35,1))+
labs(y="log10 of observations",
x="profile cluster",
title="Logarithmic Distribution of Profile Belonging",
caption="Source: Feature Engineering Behavior_index Clustering")+
theme_light()+theme(axis.text.x = element_text(angle = 75, hjust = 1))
#ggsave("distribution_profiles.png",plot=gg13, width = 8, height=5, units="in")
The profiles were saved in a lookup table and joined onto the data. Overall, adding the behavior_index as well as behavior_profile increased our accuracy by a significant amount. We were lucky and could use the data for g, l and c as additional Features.
A day before the deadline, after the meeting with Prof. Löcher and to his suggestion we used the existing behavior_index architecture to create new features, not just for genre, but also for artist and album. In this report we set the following code to eval=FALSE, because it is very time consuming. For Example: The lookup table for artists/user-behavior has roughly 1.3e+09 rows, the lookup for album 2.97e+09 rows.
profs <- DeezerNew_train_0525 %>%
group_by(user_id,artist_id) %>%
summarise(c = sum(is_listened)) %>%
mutate(p = c/sum(c))
profs <- profs[,-3] %>%
dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything
artist_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="artist_id",value.name="artist_scaled")
#save(artist_scaled.molten,file="30_05_artist_scaled_lookup.rda")
profs <- DeezerNew_train_0525 %>%
group_by(user_id,artist_id) %>%
summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
dcast(user_id ~ artist_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id
artist_listened.molten <- melt(listened, id.vars="user_id",variable.name="artist_id",value.name="artist_listened")
#save(artist_listened.molten,file="30_05_artist_listened_lookup.rda")
profs <- DeezerNew_train_0525 %>%
group_by(user_id,artist_id) %>%
summarise(n=n()) %>%
dcast(user_id ~ artist_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id
artist_counts.molten <- melt(counts, id.vars="user_id",variable.name="artist_id",value.name="artist_counts")
#save(artist_counts.molten,file="30_05_artist_counts_lookup.rda")
behavior_artist <- sqrt(scaled[,-67143])*listened[,-67143]*log(counts[,-67143]+1)
behavior_artist$user_id <- scaled$user_id
behavior_artist_molten <- melt(behavior_artist, id.vars="user_id",variable.name="artist_id",value.name="behavior_artist")
#save(behavior_artist_molten,file="30_05_behavior_artist_lookup.rda")
Here is the same code for album_id:
profs <- DeezerNew_train_0525 %>%
group_by(user_id,album_id) %>%
summarise(c = sum(is_listened)) %>%
mutate(p = c/sum(c))
profs <- profs[,-3] %>%
dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
scaled <- as.data.frame(t(apply(profs[,-1], 1, function(x)(x-min(x))/(max(x)-min(x)))))
scaled$user_id <- profs$user_id
scaled[is.na(scaled)] <-0 #for users who have never listened to anything
album_scaled.molten <- melt(scaled, id.vars="user_id",variable.name="album_id",value.name="album_scaled")
#save(album_scaled.molten,file="30_05_album_scaled_lookup.rda")
profs <- DeezerNew_train_0525 %>%
group_by(user_id,album_id) %>%
summarise(c = sum(is_listened),n=n(),p=c/n)
profs <- profs[,-c(3,4)] %>%
dcast(user_id ~ album_id, value.var="p")
profs[is.na(profs)] <- 0
listened <- profs[,-1]
listened$user_id <- profs$user_id
album_listened.molten <- melt(listened, id.vars="user_id",variable.name="album_id",value.name="album_listened")
#save(album_listened.molten,file="30_05_album_listened_lookup.rda")
profs <- DeezerNew_train_0525 %>%
group_by(user_id,album_id) %>%
summarise(n=n()) %>%
dcast(user_id ~ album_id, value.var="n")
profs[is.na(profs)] <- 0
counts <- profs[,-1]
counts$user_id <- profs$user_id
albums_counts.molten <- melt(counts, id.vars="user_id",variable.name="album_id",value.name="album_counts")
#save(album_counts.molten,file="30_05_album_counts_lookup.rda")
behavior_album <- sqrt(scaled[,-ncol(scaled)])*listened[,-ncol(listened)]*log(counts[,-ncol(listened)]+1)
behavior_album$user_id <- scaled$user_id
behavior_album_molten <- melt(behavior_album, id.vars="user_id",variable.name="album_id",value.name="behavior_album")
#save(behavior_album_molten,file="30_05_behavior_album_lookup.rda")
Sadly the computation was not fast enough to enable us to get a final prediction with the new features. But we ran some hyper tuned models after the competition and will present the results later on.
The last step before we start with modeling, it is needed to merge all new features by their look up tables with the old dataset. As some of the merges were quite time intensive (even on the zeno server), we execute the code just for the test dataset for illustrating how it works.
Loading required data and lookup tables:
Deezer <- read.csv("/home/Deezer/10_Basic_Dataset/train.csv")
Deezer_test <- read.csv("/home/Deezer/10_Basic_Dataset/test.csv")
load("/home/Deezer/30_Wrangled_Data/tracks_cleaned.rda") #BPM column
load("/home/Deezer/30_Wrangled_Data/album_genre_clean.rda") #new_genre
load("/home/Deezer/30_Wrangled_Data/21_05_new_behavior_lookup.rda")
load("/home/Deezer/30_Wrangled_Data/23_05_SQRT_35_Clusters.rda")
load("/home/Deezer/30_05_genre_counts_lookup.rda")
load("/home/Deezer/30_05_genre_listened_lookup.rda")
load("/home/Deezer/30_05_genre_scaled_lookup.rda")
#These lookup tables are really big and it takes quite long to load them into the environment. Therefore, we decided to uncomment these lines.
#load("/home/Deezer/30_05_artist_listened_lookup.rda")
#load("/home/Deezer/30_05_artist_scaled_lookup.rda")
#load("/home/Deezer/30_05_artist_counts_lookup.rda")
The lookup tables already provide the user_ids as factors. However, the datatypes of the competition set has not been modified yet. This means all features are still numeric variables. We decided to merge the data by the remaining numeric value instead of the factor as the merge itself can be executed faster. Therefore, we change the user_id of the lookup tables into numeric values at first.
behavior_molten$user_id <- as.integer(as.character(behavior_molten$user_id))
#class(behavior_molten$user_id)
cluster_user_lookup$user_id <- as.integer(as.character(cluster_user_lookup$user_id))
#new genre features
genre_counts.molten$user_id <- as.integer(as.character(genre_counts.molten$user_id))
genre_listened.molten$user_id <- as.integer(as.character(genre_listened.molten$user_id))
genre_scaled.molten$user_id <- as.integer(as.character(genre_scaled.molten$user_id))
#new artist features:
# artist_counts.molten$user_id <- as.integer(as.character(artist_counts.molten$user_id))
# artist_listened.molten$user_id <- as.integer(as.character(artist_listened.molten$user_id))
# artist_scaled.molten$user_id <- as.integer(as.character(artist_scaled.molten$user_id))
# artist_counts.molten$artist_id <- as.integer(as.character(artist_counts.molten$artist_id))
# artist_listened.molten$artist_id <- as.integer(as.character(artist_listened.molten$artist_id))
# artist_scaled.molten$artist_id <- as.integer(as.character(artist_scaled.molten$artist_id))
#Due to the fact that we had problems while removing sample_id, i created a NA column for train data as well.
Deezer_test$is_listened = 0
Deezer$sample_id = NA
In the beginning of the project where we had just small lookup tables, which needed to be merged with the competition dataset, no serious problems occurred within the joining process. However, as the lookup tables became larger and larger (e.g. the artist table), the needed computational power and time was increasing vastly. Therefore, we converted the lookup tables into data tables. This speeds up the process of joining big tables quite much. However, the three new lookup tables based on artist information, which were created in the previous chapter, were so big that we could not even do any operations on it. Hence, we decided to exclude it from the model.
behavior_molten <- data.table(behavior_molten)
cluster_user_lookup <- data.table(cluster_user_lookup)
genre_counts.molten <- data.table(genre_counts.molten)
genre_listened.molten <- data.table(genre_listened.molten)
genre_scaled.molten <- data.table(genre_scaled.molten)
#artist_counts.molten <- dataframe(artist_counts.molten)
#artist_listened.molten <- dataframe(artist_listened.molten)
#artist_scaled.molten <- dataframe(artist_scaled.molten)
class(behavior_molten)
## [1] "data.table" "data.frame"
We created a merging function, which does all joins for a preferred dataset. For the demonstration, we just merge the 1000 rows of the Deezer train dataset.
#dim(Deezer) #great for checking that number of rows stay constant while merging
dim(Deezer_test)
## [1] 19918 16
#joining data:
get_new_data <- function(x) {
x <- left_join(x, albums2, by = "album_id", all.x=T, sort=F) #alb.genre_2..
x <- left_join(x, tracks_clean2, by = "media_id",all.x=T, sort= F) # BPM
x <- merge(x, behavior_molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F) #old listening behavior, we
x <- left_join(x, cluster_user_lookup, by = "user_id",all.x=T, sort =F) #new clusters
timestamp = as.POSIXct(x$ts_listen, origin="1970-01-01")
splitdt <- data.frame(
hh = as.numeric(format(timestamp, format = "%H")), #24hours format
wd = as.numeric(format(timestamp, format = "%w"))) #weekday
x = cbind(x, splitdt) #cbind the extracted time to data
releaseDate = as.Date(as.character(x$release_date), format="%Y%m%d")
splitrdt <- data.frame(ryear = as.numeric(format(releaseDate, format = "%Y")))
x = cbind(x, splitrdt)
#new features (30.05.17)
x <- merge(x, genre_listened.molten , x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
x <- merge(x, genre_scaled.molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
x <- merge(x, genre_counts.molten, x.by = c("user_id","alb.genre_2_01"),all.x=T,sort=F)
# x <- merge(x, artist_counts.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
# x <- merge(x, artist_scaled.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
# x <- merge(x, artist_listened.molten, x.by = c("user_id","artist_id"),all.x=T,sort=F)
#remove old columns / other columns are deleted below:
x$genre_id = NULL #delete old genre column
x$ts_listen = NULL #remove ts_listen
x$release_date = NULL #remove release_date
return(x)
}
#DeezerNew_train <- get_new_data(Deezer)
DeezerNew_test <- get_new_data(Deezer_test)
#dim(DeezerNew_train)
dim(DeezerNew_test)
## [1] 19918 23
DeezerNew_test <- DeezerNew_test %>%
arrange(sample_id)
It can be seen, that the function added 10 new features (23-16+3 missing artist features), while the number of rows stay fixed. However, before we can use the data we have to modify the datatypes. That is done by the next function: data_type_function.
#converting specific columns in factors
data_type_function <- function(x){
cols <- c("media_id", "album_id", "context_type", "platform_name", "platform_family", "listen_type", "user_gender", "user_id", "artist_id", "is_listened", "alb.genre2_01", "wd", "profile_id")
x[,cols] <- data.frame(apply(x[cols], 2, as.factor))#change data type for these columns into factors
return(x)
}
#DeezerNew_train <- data_type_function(DeezerNew_train)
DeezerNew_test <- data_type_function(DeezerNew_test)
Furthermore we did some additional adjustments. This was caused by merging the new behavior feature. For the test data we gained 4 NAs which might not be problematic as they are so less. However, in the train data set this effect occurs 1500 times.
sum(is.na(DeezerNew_test$behavior))
## [1] 4
Checking the NAs for this specific column reveals the values which could not be joined. The spelling of the German letters ä and ü letters had caused the troubles. In the lookup table they are written like: “T
Deezer_fail <- DeezerNew_test %>%
filter(is.na(DeezerNew_test$behavior))
Deezer_fail %>%
group_by(alb.genre2_01) %>%
summarise (n = n())
## # A tibble: 2 × 2
## alb.genre2_01 n
## <fctr> <int>
## 1 Hörbücher 1
## 2 TürkischeVolksmusik 3
These genres are not that important for the modeling part. Also, they occur just 1500 times out of 7.5 million rows. Therefore, we decided to fill the NAs with a listening behavior of 0.
#DeezerNew_train$behavior[is.na(DeezerNew_train$behavior)] <- 0
DeezerNew_test$behavior[is.na(DeezerNew_test$behavior)] <- 0
In the end of our project we also add new time information to our models. These features are also added to the models.
## Add time as circular value
#DeezerNew_train$xhh <- sin(2 * pi * DeezerNew_train$hh / 24)
#DeezerNew_train$yhh <- cos(2 * pi * DeezerNew_train$hh / 24)
DeezerNew_test$xhh <- sin(2 * pi * DeezerNew_test$hh / 24)
DeezerNew_test$yhh <- cos(2 * pi * DeezerNew_test$hh / 24)
## Delete the hh column
#DeezerNew_train <- subset(DeezerNew_train, select = -hh)
DeezerNew_test <- subset(DeezerNew_test, select = -hh)
In the end both files are combined. This was needed for hot endcoding and creating the belonging sparse matrix. Afterwards, all three new datasets are saved.
## Combine the files
#DeezerNew_traintest <- rbind(DeezerNew_train,DeezerNew_test)
## Save the new files
#save(DeezerNew_train, file = "/home/Deezer/30_Wrangled_Data/Deezer_train_0702.rda")
#save(DeezerNew_test, file = "/home/Deezer/30_Wrangled_Data/Deezer_test_0702.rda")
#save(DeezerNew_traintest, file = "/home/Deezer/30_Wrangled_Data/Deezer_traintest_0702.rda")
rm(list=ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1931160 103.2 6393998 341.5 7992498 426.9
## Vcells 6259405 47.8 980875092 7483.5 1225497955 9349.9
Data scientists spend most of their time on cleaning, labeling and organizing data (link. Fortunately, this task has been done. With a solid base of available data, the time has come to build and train our models.
This competition is a classification problem. Therefore, there are several classification algorithms to choose from, e.g. logistic regression, decision trees, boosted trees, k-nearest neighbors, singular value decomposition (SVD) and many more.
As gradient boosted trees are extremely powerful and win many kaggle competitions on a regular basis or as kaggle grandmaster Ben Gorman said, “If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter” (link), we immediately decided to go with the gradient boosted tree approach. The XGboost library is particularly popular because it has won more than half of all Kaggle competitions (link). Therefore, we decided to focus on it as our main library.
Furthermore, we also tried a Naive Bayes and a Deep Learning approach. Other libraries to build these models include h2o, lightgbm and naivebayes. In the end though, our best model submitted was based on XGboost. Let us take you through the model building process in the following sub-chapters.
As a general note, we usually tested our model in an RMD file in RStudio on the Zeno server. If everything worked, we would stick our code into a regular R file (R script) and ran the code via Bash and the screen command. This allowed us to run the code for several days while closing the terminal.
XGboost only handles numeric values. Our dataset contains a lot of categorical values though. There are two major methods to deal with this issue. One method is called label encoding. Basically a categorical values e.g. male, female and neither are assigned numbers such as 0, 1 and 2. This will then be applied to all categorical values. See the table below.
## gender n
## 1 Male 0
## 2 Female 1
## 3 Neither 2
## 4 Trans Male 3
## 5 Trans Female 4
Surprisingly, XGboost deals with label encoded values very well. However, we felt that this was not the most appropriate method from a statistical perspective. Since XGboost does not possess a memorization feature where it recognizes that data was transformed from a factor to a numeric value we believed it was actually not appropriate to use this method because that would mean that “Female” is higher/bigger/larger/ than “Male” because it has a larger value. However, in chapter 4(b) you will see that it also achieved a high accuracy.
Therefore, we decided to use another method which is called one-hot encoding. Another aspect that added weight for this method was that we touched this topic when we talked about regularization. What follows is a description of how we built our very first finally the evolution towards our best XGboost - Sparse model, which also happened to be our best overall model. The initial model building process is based on a tutorial (link) that is included in the official XGboost documentation.
We also reviewed the parameters in the official documentation (link) and on Analytics Vidhya (link) which offers a nice guide that explains the parameters in more detail.
In this very first model we didn’t include all the features we developed over the course because at that time we were still working on the feature engineering part. Thus, in this model we included the BPM and genre features.
## 0. Load the data.
load("/home/Deezer/30_Wrangled_Data/Archiv/Deezer_new.rda")
## 1. Wrap full_dataset with data.table because its performance for large datasets is best in class.
df <- data.table(DeezerNew_all, keep.rownames = F)
summary(df) # Check for any NAs because XGboost can also not deal with those.
## media_id album_id context_type platform_name
## 132434634: 20162 14038232: 127512 0 :3198365 0:5095854
## 133165774: 17683 14530576: 64215 1 :1633820 1:1380179
## 132123604: 16322 13246617: 46438 2 :1052844 2:1102719
## 132123626: 15165 14139248: 38757 3 : 433456
## 132123630: 14370 12870014: 29878 4 : 230259
## 129632340: 13088 14466676: 29594 5 : 170740
## (Other) :7481962 (Other) :7242358 (Other): 859268
## platform_family media_duration listen_type user_gender
## 0:6058802 Min. : 0.0 0:5239224 0:4594298
## 1:1102719 1st Qu.: 196.0 1:2339528 1:2984454
## 2: 417231 Median : 222.0
## Mean : 231.2
## 3rd Qu.: 254.0
## Max. :65535.0
##
## user_id artist_id user_age is_listened
## 0 : 6318 1519461: 163792 Min. :18.00 0:2408260
## 1 : 6241 1191615: 161759 1st Qu.:21.00 1:5170492
## 2 : 6212 388973: 103822 Median :25.00
## 3 : 5582 390: 75694 Mean :24.31
## 4 : 5427 246791: 50069 3rd Qu.:28.00
## 5 : 5401 13011: 48379 Max. :30.00
## (Other):7543571 (Other) :6975237
## alb.genre2_01 track_bpm month hh
## Rap/HipHop :2563755 Min. : 49.3 11 :7456988 18 : 563039
## Pop :1897947 1st Qu.:107.0 12 : 81095 17 : 546202
## Alternative: 839671 Median :124.0 10 : 38894 19 : 488080
## Electro : 641101 Mean :127.2 1 : 840 16 : 482949
## Dance : 527809 3rd Qu.:141.0 9 : 409 15 : 453394
## Rock : 283019 Max. :268.5 8 : 388 11 : 434923
## (Other) : 825450 (Other): 138 (Other):4610165
## wd ryear
## 0: 878015 Min. :1900
## 1: 930328 1st Qu.:2009
## 2:1219571 Median :2014
## 3:1201765 Mean :2011
## 4:1178697 3rd Qu.:2016
## 5:1123252 Max. :3000
## 6:1047124
## 2. Create Sparse Matrix. We leave "is_listened" out because we want to predict it (is_listened ~. means all categorical features are converted to binary values except the "is_listened" column). Unfortunately, this will take around 5 minutes on the server.
sparse_matrix <- sparse.model.matrix(is_listened ~.-1, data = df)
dim(sparse_matrix) # As you can see, this is a gigantic sparse matrix.
## [1] 7578752 692189
## 2.1. Split the Sparse Matrix up according to the train (7558834 rows) and test (remaining rows) data. We will limit the number of rows for demonstration purposes. In a real setting the rows would not be restricted.
sparse_train <- sparse_matrix[1:50000,] # sparse_matrix[1:7558834,] in our real code
sparse_test <- sparse_matrix[50001:60000,] # sparse_matrix[7558835:7578752,] in our real code
dim(sparse_test)
## [1] 10000 692189
dim(sparse_train)
## [1] 50000 692189
## 3. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerNew_all[1:50000,"is_listened"] == 1 # DeezerNew_all[1:7558835,"is_listened"] == 1 in our real code.
## 4. Build the model. The parameters are chosen based on what we believed was "right" after reading through the official documentation.
bst <- xgboost(data = sparse_train, # Data source to train the model.
label = output_vector, # Response variable.
max.depth = 6, # Max depth of the tree.
eta = 0.2, # Learning rate, describes the weight of a newly built tree.
verbose = 1, # Information output.
nrounds = 10, # Number of rounds. We limit the number of training rounds to 10 for demonstration purposes.
nthread = 32, # Number of CPUs.
early_stopping_rounds = 10, # Number of rounds when the program should stop training without seeing an improvement.
eval_metric = "auc", # Evaluation metric. Since the Evaluation metric is AUC in the competition, we chose this one.
objective = "binary:logistic") # Definition of the loss function that shall be minimized. This is a probabilistic binary classification problem (values ranging from 0 to 1). Therefore, we chose "binary:logistic".
## [1] train-auc:0.656732
## Will train until train_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.662560
## [3] train-auc:0.669729
## [4] train-auc:0.670759
## [5] train-auc:0.675526
## [6] train-auc:0.677984
## [7] train-auc:0.680857
## [8] train-auc:0.683110
## [9] train-auc:0.686795
## [10] train-auc:0.688798
## 4.1. Fit the model.
XGpred <- predict(bst, sparse_test)
## 5. Create a submission file.
submission <- read.csv("/home/Deezer/10_Basic_Dataset/sample_submission_kaggle.csv")
submission$is_listened[1:10000] <- XGpred
#write.csv(submission, "OHE_xgb300_May17_wo_profileID.csv", row.names = FALSE)
This was our first XGboost - sparse model. It achieved an AUC score of 0.58639 in the public leaderboard and 0.59343 in the private leaderboard. That is already a good first result but still far away from being very good. At that time, the best team had a score of around 0.65.
Our main considerations for improvement were:
The following code describes our best model. Note, to fully understand what is happening, we only restricted the number of rounds and iterations in the HT+CV and fitting phases. Considering that the sparse matrix is huge, please allow the script to run for round about 7 minutes to prepare the data. Please make sure you run the script on the server.
We start with HT and CV.
## 1. Load only the train data.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
summary(DeezerNew_train_0525) # Check for any NAs because XGboost can also not deal with those.
## user_id alb.genre2_01 media_id
## 0 : 6317 Rap/HipHop :2556327 132434634: 20148
## 1 : 6240 Pop :1893072 133165774: 17652
## 2 : 6211 Alternative: 837705 132123604: 16289
## 3 : 5581 Electro : 639527 132123626: 15131
## 4 : 5426 Dance : 526270 132123630: 14347
## 5 : 5400 Rock : 282338 129632340: 13076
## (Other):7523659 (Other) : 823595 (Other) :7462191
## album_id context_type platform_name platform_family
## 14038232: 127230 0 :3198365 0:5083338 0:6041392
## 14530576: 64145 1 :1617653 1:1374227 1:1101269
## 13246617: 46307 2 :1052844 2:1101269 2: 416173
## 14139248: 38671 3 : 433456
## 12870014: 29738 4 : 230259
## 14466676: 29536 5 : 167428
## (Other) :7223207 (Other): 858829
## media_duration listen_type user_gender artist_id
## Min. : 0.0 0:5239223 0:4583009 1519461: 163420
## 1st Qu.: 196.0 1:2319611 1:2975825 1191615: 161416
## Median : 222.0 388973: 103643
## Mean : 231.2 390: 75457
## 3rd Qu.: 254.0 246791: 49879
## Max. :65535.0 13011: 48188
## (Other) :6956831
## user_age is_listened sample_id track_bpm
## Min. :18.00 0:2388342 Mode:logical Min. : 49.3
## 1st Qu.:21.00 1:5170492 NA's:7558834 1st Qu.:107.0
## Median :25.00 Median :124.0
## Mean :24.31 Mean :127.2
## 3rd Qu.:28.00 3rd Qu.:141.0
## Max. :30.00 Max. :268.5
##
## behavior profile_id hh wd
## Min. :0.000 3 :2034837 18 : 561307 0: 875686
## 1st Qu.:1.450 5 :1355185 17 : 544683 1: 928224
## Median :2.913 6 :1213022 19 : 486507 2:1216865
## Mean :2.972 8 : 489882 16 : 481687 3:1198530
## 3rd Qu.:4.357 4 : 287900 15 : 452299 4:1173973
## Max. :7.817 13 : 271744 11 : 433921 5:1120992
## (Other):1906264 (Other):4598430 6:1044564
## ryear
## Min. :1900
## 1st Qu.:2009
## Median :2014
## Mean :2011
## 3rd Qu.:2016
## Max. :3000
##
## 2. We wanted to use the full dataset to get the most out of the available information. We first needed to delete the "sample_id" column because they are empty.
DeezerNew_train_0525 <- subset(DeezerNew_train_0525, select = -sample_id)
## 3. Since this is a computational expensive task, we run our HT and CV on 1/3 of the data.
DeezerSampleCV <- DeezerNew_train_0525[sample(nrow(DeezerNew_train_0525), nrow(DeezerNew_train_0525) * .33), ]
## 4. Wrap the sample in a data.table for faster computation.
Deezer_DT <- data.table(DeezerSampleCV, keep.rownames = FALSE)
## 5. Create sparse matrix (leave is_listened out because we want to predict it). -1 deletes the first column because this function genrates a new first column with 1s filled out
sparse_train <- sparse.model.matrix(is_listened ~.-1, data = Deezer_DT)
dim(sparse_train) # Check the dimensions.
## [1] 2494415 691695
## 6. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerSampleCV[,"is_listened"] == 1
Now that we have set up the data we can run the HT and CV. Grid search and manual search are two of the most common used strategies for hyper-parameter tuning. Bergstra and Bengio (2012) (link) show empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter tuning than trials on a grid. Therefore, we decided to go with random values in our grid.
## 1. We initialize a list and some variables.
best_param <- list() # Create an empty list to store our results.
best_seednumber <- 1234 # Set an initial seed for reproducability.
best_logloss <- Inf
best_logloss_index <- 0
## 2. We then initialize a for loop that runs through several parameter combinations.
for (iter in 1:4) { # for (iter in 1:250) in our real code
## Our parameter list. This is the HT part.
param <- list(objective = "binary:logistic", # Definition of the loss function.
eval_metric = "logloss", # Evaluation metric.
max_depth = sample(6:12, 1), # Max depth of the trees between 6 and 12.
eta = runif(1, .01, .3), # Learning rate between 0.01 and 0.3.
gamma = runif(1, 0.0, 0.2), # Specifies the minimum loss reduction required to make a split in a node. Higher values make the algorithm more conservative.
subsample = runif(1, .6, .9), # Denotes the fraction of observations to be random samples for each tree.
colsample_bytree = runif(1, .5, .8), # Denotes the fracation of columns to be randomly chosen. 1 means to choose all columns.
min_child_weight = sample(1:40, 1), # Defines the minimum sum of weights of all observations required in a child. Higher values prevent the model from over-fitting.
max_delta_step = sample(1:10, 1) # Introduces an "absolute" regularization, capping the weight before eta is applied. Higher values make the algortihm more conservative.
)
## Additional information which we need for the CV. This is the CV part.
cv_nround <- 5 # Set the number of training rounds. In our real code this was set to 300.
cv_nfold <- 5 # Set the number of folds. 5 is a standard value.
seed_number = sample.int(10000, 1)[[1]] # Choose a random number.
set.seed(seed_number) # Set a seed for reproducability.
message("Iteration Round: ", as.character(iter), appendLF = FALSE) # Check at which iteration we are.
## The actual CV model that uses a HT combination from above.
mdcv <- xgb.cv(data = sparse_train,
label = output_vector,
params = param,
nfold = cv_nfold,
nrounds = cv_nround,
nthread = 32,
verbose = TRUE,
early_stopping_rounds = 20,
maximize = FALSE
)
## We would like to store the results of our HT and CV.
min_logloss <- min(mdcv$evaluation_log$test_logloss_mean)
min_logloss_index <- which.min(mdcv$evaluation_log$test_logloss_mean)
## If the result was better than the ones we saw before, we will save it in the list we created in the beginning.
if (min_logloss < best_logloss) {
best_logloss = min_logloss
best_logloss_index = min_logloss_index
best_seednumber = seed_number
best_param = param
}
}
## Iteration Round: 1
## [1] train-logloss:0.621988+0.009152 test-logloss:0.622210+0.009166
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 20 rounds.
##
## [2] train-logloss:0.580552+0.004552 test-logloss:0.581014+0.004784
## [3] train-logloss:0.556074+0.003271 test-logloss:0.556763+0.003478
## [4] train-logloss:0.540744+0.003123 test-logloss:0.541659+0.003187
## [5] train-logloss:0.532734+0.003270 test-logloss:0.533777+0.003370
## Iteration Round: 2
## [1] train-logloss:0.656056+0.002465 test-logloss:0.656093+0.002337
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 20 rounds.
##
## [2] train-logloss:0.629194+0.001385 test-logloss:0.629240+0.001199
## [3] train-logloss:0.612203+0.002734 test-logloss:0.612272+0.002896
## [4] train-logloss:0.595135+0.002558 test-logloss:0.595219+0.002663
## [5] train-logloss:0.581699+0.002389 test-logloss:0.581811+0.002752
## Iteration Round: 3
## [1] train-logloss:0.624687+0.003241 test-logloss:0.624958+0.003119
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 20 rounds.
##
## [2] train-logloss:0.587113+0.004482 test-logloss:0.587566+0.004361
## [3] train-logloss:0.562542+0.002089 test-logloss:0.563165+0.001880
## [4] train-logloss:0.548093+0.003765 test-logloss:0.548879+0.003473
## [5] train-logloss:0.536273+0.003903 test-logloss:0.537293+0.003619
## Iteration Round: 4
## [1] train-logloss:0.627545+0.010758 test-logloss:0.627661+0.010566
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 20 rounds.
##
## [2] train-logloss:0.590365+0.012340 test-logloss:0.590602+0.012109
## [3] train-logloss:0.564107+0.009817 test-logloss:0.564460+0.009520
## [4] train-logloss:0.550047+0.005461 test-logloss:0.550482+0.005232
## [5] train-logloss:0.541821+0.006418 test-logloss:0.542366+0.006075
## Finally, we save the best parameters in a CSV file to add the parameters to our final model later. Alternatively, we could have just added thoses parameters to the parameter settings of the final model.
#write.csv(best_param, file = "best_parameters_FinalCV.csv")
It is time for the final model building part. We have to apply our best settings to train the model on the full dataset. Once, again we start with a little bit of data preparation.
## 0. Just to make sure we have the correct source data, we load it once again. These are the train, test and combined datasets.
load("/home/Deezer/30_Wrangled_Data/Deezer_train_0525.rda")
load("/home/Deezer/30_Wrangled_Data/Deezer_test_0525.rda")
load("/home/Deezer/30_Wrangled_Data/Deezer_traintest_0525.rda")
## 1. In each dataset we delete the "sample_id".
DeezerNew_traintest_0525 <- subset(DeezerNew_traintest_0525, select = -sample_id)
DeezerNew_train_0525 <- subset(DeezerNew_train_0525, select = -sample_id)
DeezerNew_test_0525 <- subset(DeezerNew_test_0525, select = -sample_id)
## 2. This time we wrap the the combined dataset in a data.table for faster computation.
Deezer_DT <- data.table(DeezerNew_traintest_0525, keep.rownames = F)
## 3. Once again, we create a Sparse Matrix (leave is_listened out because we want to predict it). -1 deletes the first column because this function genrates a new first column with 1s filled out.
sparse_matrix <- sparse.model.matrix(is_listened ~.-1, data = Deezer_DT)
dim(sparse_matrix)
## [1] 7578752 692213
## 3.1. Split the Sparse Matrix up according to the train (7558834 rows) and test (remaining rows) data.
sparse_train <- sparse_matrix[1:nrow(DeezerNew_train_0525),]
sparse_test <- sparse_matrix[7558835:nrow(DeezerNew_traintest_0525),]
dim(sparse_test)
## [1] 19918 692213
dim(sparse_train)
## [1] 7558834 692213
## 4. Create the Boolean output vector for XGboost. This will be our label for the model. It generates a vector that is TRUE for every is_listened == 1 and FALSE otherwise.
output_vector <- DeezerNew_train_0525[,"is_listened"] == 1
Now we set up the model and add our best parameters. During the competition we added those parameters manually from the CSV file because we always ran a normal R script on the Zeno server via bash and the “screen” command.
## We already added the best model parameters.
param <- list(objective = "binary:logistic",
eval_metric = "logloss",
max_depth = 10,
eta = .211,
gamma = .0509,
subsample = .65,
colsample_bytree = .534,
min_child_weight = 2,
max_delta_step = 10
)
## Model Input
XGB_300_fulldataset_0526 <- xgboost(params = param,
data = sparse_train,
label = output_vector,
verbose = TRUE,
nrounds = 2, #in our real code, this was set to 300
nthread = 32,
early_stopping_rounds = 20,
maximize = FALSE
)
## [22:16:31] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
## [1] train-logloss:0.657479
## Will train until train_logloss hasn't improved in 20 rounds.
##
## [2] train-logloss:0.616959
XGB_300_fulldataset_0526_Pred <- predict(XGB_300_fulldataset_0526, sparse_test)
## Create submission file
submission <- read.csv("/home/Deezer/10_Basic_Dataset/sample_submission_kaggle.csv")
submission$is_listened <- XGB_300_fulldataset_0526_Pred
#write.csv(submission,"XGB_300_fulldataset_0526_Pred",row.names = FALSE)
Let’s have a look at the most important features. Note, XGboost allows you to save trained models. We saved a model that has the same input as above but was only trained for 100 rounds. Difference between the 300 and 100 round model should be relatively small. For convenience reasons we are going to load the 100 round model and plot the feature importance on it.
bst <- xgb.load("/home/Deezer/20_Scripts/20_Scripts_for_Modelling/30_XGB_Trained_Models/XGB_100_fulldataset_0527.model")
prediction <- predict(bst, sparse_test)
mat <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
print(mat)
## Feature Gain Cover Frequency
## 1: behavior 3.009967e-01 6.986890e-02 5.626473e-02
## 2: profile_id 5 1.971010e-01 1.283766e-02 3.731343e-03
## 3: listen_type1 4.851180e-02 7.215010e-03 1.463079e-02
## 4: platform_name2 4.190538e-02 8.869195e-03 9.131972e-03
## 5: platform_family1 3.235921e-02 5.710806e-03 5.449725e-03
## ---
## 3368: album_id 401340 2.133687e-08 5.118559e-09 4.909662e-05
## 3369: artist_id10458629 2.114077e-08 8.759004e-09 4.909662e-05
## 3370: artist_id 807296 1.801582e-08 1.187721e-07 4.909662e-05
## 3371: user_id15792 1.688117e-08 1.381899e-08 4.909662e-05
## 3372: user_id 978 1.594599e-08 4.511726e-06 4.909662e-05
xgb.plot.importance(xlab = "Contribution", ylab = "Feature", importance_matrix = mat[1:20])
## Relative Importance
xgb.plot.importance(mat[1:20], rel_to_first = TRUE, xlab = "Relative importance")
ggplot(mat[1:20,], aes(reorder(Feature, Gain), Gain)) +
geom_col(fill = "lightblue") +
coord_flip() +
labs(x = "Feature", y = "Contribution", title = "Feature Importance", caption = "Source: XGboost Trained Model")
As you can see our new features “profile_ID” and “Bahavior” had quite an impact. This model was our best one. It achieved an AUC score of 0.63749 (0.58639) in the public leaderboard and 0.63860 (0.59343) in the private leaderboard. That is an improvement of around 8.7% and 7.6% respectively.
In terms of accuracy, the XGBoost Sparse model gave us the best accuracy on private leaderboard score however it was computationaly expensive and much time consuming considering the sparse matrix and grid search approach. Other models, such XGBoost (label encoded), LightGBM (super fast) and H2O autoML (extremely fast i.e 90 mins only) were computationally cheap and we were able to run these model locally without the server. In terms of accuracy, it was slightly lower than XGBoost Sparse model.
Caret is a package for feature selection, model tuning and variable importance estimation (see https://topepo.github.io/caret/index.html) and contains 233 different models. It enables to compare different models, but uses a lot of memory for that. We found that short before the deadline. Because of the memory consumption, we interrupted the processing to focus complete to the last runs of cross-validation. In general it should be worth to keep Caret in mind.
While testing the package with a small subset, the following error occured after a while: “At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names”. But if we used make.names() as suggested, it’s output was identical to the column names.
Just out of curiosity we tried a little competition: SAP Predictive Analysis vs. R . Based on the lectures of Nancy Jones (San Diego State University) we explored the possibilities of SAP PA. We hoped, that we can import also R models to test them with automatic visualization. Unfortunately, that is not possible. Therefore we had run the complete model creation.
The used dataset was DeezerNew_train_0530.csv with 7,558,834 and 25 features. The learning phase took all in all 76 minutes. The chosen prediction engine by SAP Predictive Analysis was Kxen.RobustRegression. In the end it just kept 10 features, where the feature profile_id contributed the most. But the accuracy of the predictions were not that high, it ends up with a low scoring of 0.4798 . Because of the less influence factors, there was no possibility to improve the scoring.
SAP Predictive Analysis can be a suggestion for Managers with a low Data Science background. It is really good, that a complete summary page (as HTML, in the folder Models) is prepared to take into meetings and one summary slide to put into a presentation. With R are the possibilities nearly endless, but you have to know what to do and create the figures and numbers by your own. The advantage is, that you can get a better prediction power.
alt text
Below is the statistic about this competition.
alt text
We were quite flexible within our group and everyone was encouraged to try and test new features in the feature engineering part. Max and Vera applied advanced statistical concepts and came up with new features which significantly improved the model accuracy. Pranav and Christian focused on building models and improving the model accuracy by means of parameter tuning as well as trying and testing custom and relative features with basic feature engineering. Along with supporting tasks, Dennis also provided expert solutions in order to resolve technical issues we faced, especially with the Zeno server. Overall, we discussed our results in meetings in a biweekly manner. This helped a lot according to updating other group members, questioning some approaches and brainstorming for new ideas.
In detailed, the tasks were divided as the following:
Chris: Christian mainly focused on the sparse XGboost approach. He converted dense dataframes into sparse matrices, designed the model and implemented hyperparameter tuning in combination with cross-validation to improve the model’s accuracy.
Dennis: While the subgroups for feature engineering and modelling were already formed and worked in efficient groups, he concentrated to supporting and independed tasks. He tried Naive Bayes to find out, if the trade-off between speed and accuracy makes sense. Furthermore he did research about Caret for feature and model selection and compared SAP Predictive Analysis with the group results.
Max: Max focused on feature engineering and teamed up with Vera a lot of the time. Since coding as a duo is, how the group found out for themself, really benefitial for qualitative working time. Max came up with the idea for the “behavior”-features and coded clustering and genre-id-cleaning.
Pranav: In modeling part, Pranav focused on finding and applying most suitable modeling methods and we came up with farily latest and well known modeling algorithms such as XGBoost, LightGBM and H2O (inlcuding recent H2O automated machine learning approach). These algorithms offered very good accuracy with super fast computation specifically h2o and LighGBM. Most of these models were run on local system (laptop) in case server was occupied. He was also responsible to apply various approaches for parameters tuning for above mentioned models to reduce the test error. In the feature engineering part, Pranav applied binning method with basic feature extraction method such as date and time features from timestamp, language detection from songs and albums and then binning them into the groups in order to further improve the accuracy of model.
Vera: She was mainly involved in creating new features. While the most ideas came up in brainstorming meetings, this task was more about providing the additional information with cleaned and correct data. For example, one specific task was to create a new column genre, as the data analysis shown that genre information from the competition dataset was corrupt. This task was mainly solved with Max together. The data could not just be provided by querying the API, it also could be cleaned by finding new ways to fill the data gaps. Furthermore, Vera was responsible for merging all new feature tables with the provided competition dataset. She also provided some data descriptions to ensure a common understanding of the datasets.
alt text
When we performed HT and CV our models improved significantly. That was one of the key takeaways for us.
Creating additional features sometimes only adds noise to the dataset and also consumes time to develop. We think spend too much time on additional features from time to time which we didn’t use at all in the end. Perhaps that time could have been spent on further data exploration.
Computational costs
Insufficient computational power (gigantic sparse matrix; shared server; other groups; Prof. Löcher itself; R runs mostly on one core)
Missing domain knowledge of music and listening behavior
Problems with R Studio Web and Kick-Outs; but different accounts have the problem of different workspaces
| #### Specifications of machine used for most XGBoost NUM, H2O and LightGBM models |
| Operating System | System Model | Processor | Memory |
|---|---|---|---|
| Windows 10 Home 64-bit | Dell Inspiron 5567 | Intel Core i7-7500U CPU @ 2.70GHz (4 CPUs), ~2.9GHz | 16 GB RAM |
| #### Specifications of machine used for XGBoost Sparse models |